Practice English Speaking&Listening with: Getting C++ Threads Right

Difficulty: 0

>> Why we are interested in Threads? Well, this is sort of fairly canonical argument,

future process of performance enhancements. As we all know, likely to come from multi-core

processes. It's going to come from having parallel machines rather than enhancing single-core

performance. Often they'll [INDISTINCT] not in all cases especially--especially around

here. Program ability is a limiting factor and there is in fact lots of research on alternate

programming models for multi-core processes of parallel machines in general. There are

many people working on transactional memory. There's a lot of interest in message passing

models, clearly. There's also interest in data--in more flexible data parallel models

that we could use to program this machines. On the other hand, if we're talking about

parallelizing a single non-numerical application, I think if we look at the status quo at least,

pretty clearly threads and locks actually dominates, particularly if you look at what's

going-on on desktops. The main reason that they do I suspect, is essentially because

they're, threads and locks, have been around for a long time. They clearly predate the

widespread use of multi-processes--of multi-core processes. They were there primarily to deal

with multiple events claims and things like that that had nothing to do with parallel

machines. They where invented primary--they where originally used primarily as a programming

model for other reasons, but that makes some widely available and it has--and the natural

mechanism to exploit. On the other hand, if we look at threads and locks in general, in

many cont--in many circles at least, they have a bad reputation. Certainly we all have

horror stories about intermittent failures of multi-treaded applications. There's a general

feeling that they're too complicated. We commonly hear statements like locks don't compose which

I think is somewhat true, though it's actually hard to make that precise. If you do--if you

do a web search for the phrase "threads are evil" you'll get lots of hits. So--the point

of this talk really is that, not so much that these problems aren't real at all; I think

some of these problems are real and we should be looking at alternate programming models

as well. On the other hand, not all of the--not all of the bad reputation here is inherently

deserved. Not all of it is really rooted in profound problems. Sometimes the issue, I'll

argue, is just that we, essentially we got things wrong the first time and we haven't

gone back and fixed some of these things. Up until now, a lot of this wasn't that important

because parallel machines where multi--basically multi-process and machines where irrelevantly

esoteric, most people didn't deal with them. On the other hand, now that's clearly--clearly

changed, and in fact I think it's important to fix these problems because now they aren't--now

they have become important and I'll try to convince you that most of these problems in

fact are fixable. The problems aren't really inherent. I am going to be a little bit sloppy

here about what's actually my work and--and other people's work here, there were many

contributors to a lot of this. There's a partial list up here. Last time I checked, two of

the people up there actually work for Google. That may have change by now, it's hard to

keep track. But--so there are many other people who worked on this. As always, when looking

back, it's easy to complain about stuff that was done in the past. In fact, I don't know

if Ill get to Sarkovski, some of what was done in the past is in fact quite clever and

some--and it's quite cleaver but it did not quite make it. It's sort of--it's a very interesting

near miss, in my opinion. So it's very easy to take cheap shots at stuff that was done

to 15 years ago. He and I--I apologize for that, but nonetheless I think these are things

that need to be corrected. Some of the solutions here0--some of these problems actually are

difficult technical problems, some of them are not. So some of the solutions I'll point

to here are in fact quite obvious when you look at it and in some sense its part of the

point. We, in some cases, got--even got things wrong that are in fact, I think, not that--that

propound. And--so as a group, what we've been doing here is basically trying to identify

some of the problems that we--that we have with multi-threaded programs at the moment

and trying to resolve those whenever--whenever that's possible. And there are a number of

things we have been trying to do. Some of the technical solutions in fact turned out

to be non-trivial, and I will try to get some hint of that at least here. So working on

the technical problems are certainly been part of this effort. There's also been an

effort to try to push some of the [indistinct] C++ standards committee. Lawrence [indistinct],

he was in the room here, has been heavily involved with that as well. So the plan here

is to--the goal here is to provide proper support for threads in the next C++ standard

which is, for historical reasons--well, maybe historical reasons, it's not quiet clear yet,

as refer to a C++ 0X. The joke here for a long time was that X may actually be 0XA.

The status of this, as of the moment, is that the current working paper for the next C++

standard actually includes a lot of what I'm going to be talking about, but not everything.

So it includes the threads memory model that I'll be talk--I'll touching on here which

basically defines what it means to act as a shared variable from a thread, which is

sort of the foundational question that we're addressing here. It also includes, with things

in large part to Lawrence, in atomic types and operations library in the current raft.

And it almost includes the threads API as part of the language definition as opposed

to the definition of the auxiliary library that you use in order to get threads. I--we're

trying to do some other things as well here. In addition to the C++ committee, we've--several

of us have been talking to the C and Posix committees. Currently, at least my understanding

is that the C committee is fairly likely to--to do things in a way that's similar to what

the C++ committee is doing, which will be a very good thing. We've also--several of

us have been spending time talking to the--talking to other interested parties. And the other

interested parties who turns out here, mostly processor vendors, in order to address some

of the other issues. In particular this--the issues here, as I'll show you, extend down

to the hardware level in a few cases. So one of the things that's actually improved dramatically

recently, from my perspective, is that both Intel and AMD have released new memory model

specifications for the--for what's commonly known as the X86 architecture which resolve

some of the issues that I'll tell you about later here. And we've been trying to sort

of spread the word about some of these issues to try to resolve past misconceptions. The

rest of the talk here is basically structured as a list of problems and then explanations

of the solutions that we expect to use in order to resolve those problems. Most of these

in fact are in this--are now in the C++ COX working paper, but I should warn you they're

not yet in a formal standard, so anything can still change at this point. So, again

the main goal here is try to convince you that first of all there are problems here,

on the other hands this problems are solvable and there's a solution--for the most part,

there's solutions on the way. So, starting with the list of problems here that we encountered,

I actually want to start down sort of the lowest level, at the hardware level. If we--if

we're talking about multi-threaded programs, basically, everything relies on--on what it

means to assign to a shared variable and to access the value of a shared variable. That--that

basically gets as to this issue of memory consistency or memory ordering which is underlying

most of these. In order to get you better feeling for what we talking about here, if

you're not familiar with the issue, I will give you one simple example. If we have two

threads, one of which--one of which assigns to X and the other one assigns to Y, both

of which are shared variables, and then each thread leads the other variable. The question

is, can--can both of these threads see the--see the initial value of the variable, fail to

see the assignment in the other thread. With most people's intuition of how threads work,

that shouldn't be possible because one these threads has to go first intuitively. That's

one of the assignments of one has to go first, and therefore the other thread should be able

to see a one. There are in fact some basic algorithms that rely on the fact that this

can't happen. And a lot of programmers when they first start to program with threads basically

assume that this is the case. They implicitly assume what's--what's normally called sequential

consistency which I didn't informally define as the threads behave as though the steps

of the individual threads are just in the [indistinct]. So, by the normal interpret--by

people's normal intuition, this outcome here should not be allowed. Whether or not this

allowed--is allowed in fact is determined by all sorts of things, including the hardware,

the compiler and thread libraries and so on. And--and basically, everything on--since everything

is built on top of the hardware, for all of these things to do their job correctly, to

sort of provide the right semantics here, we need to get the understanding of the hardware

right. The interesting observation here is that if we look at this particular outcome,

most hardware in fact, even if you assume that the other layers of the stack, in fact

translate this directly to hardware instructions the way you expect, most hardware will in

fact allow this outcome here. Most hardware will in fact allow both of those to read zero,

the reason for that is that most hardware, when it executes a [INDISTINCT] instruction,

in fact does not make the results of those--that [INDISTINCT] instruction immediately visible

to other processors. It just goes into a right buffer and the processor where it's not visible.

So if this execute, basically, exactly in lock step, then the initial store may not

actually have made it to visible memory yet, or to a coherent cache by the time the second

statements were executed, so both of them may in fact see the initial value. So what

happens here? Well, fundamentally, the fact that the hardware doesn't quiet implement

what we expect to see, shouldn't be a problem because we have that in all sorts of other

respect as well. We normally just try to--try to hide any sort of miss-features like that

in the software--higher up in the stock, right? So provided we actually understand the hardware

rules, we should be able to exploit those hardware rules to implement some usable programming

model that we can all agree on, and that--that we can then program to. And that can be implemented

reasonably on the hardware, possibly by telling the hardware something else. Widely held beliefs,

at least as far as I can tell about this, that in general allowing the hardware to cheat

like this and give unexpected results is probably a good thing because what happens is that

most of the time we actually don't care about simultaneous access as to shared variables

like that. As you'll see, we generally try to discourage people from introducing simultaneous

access like that anyway. And most of time we don't care, and since we don't care the

hardware should be allowed to do whatever is cheaper which includes giving us these

unexpected outcomes. Occasionally we will care about outcomes, about disallowing things

like the outcome on the last slide. In that case, what we do is we provide, and hardware

generally provides other instructions that allow us to notify the hardware that we do

care about, the memory, disability ordering is a normally called memory fence instructions

on most hardware. They take that form in any case. That allow us to get back to sequential

consistency to the behavior that programmers expect. So, in those cases, basically, the

software stack can hide the fact that the hardware will reorder this memory access as

underneath the covers. So that's the theory. So what actually happened? Well, if we go--as

I said, recently things have gotten better here fortunately. On the other hand, if we

go back for--to--sometime around 2006, in fact, if we look at many of the architectures,

what actually happened was that, not only did they have somewhat weaker memory ordering,

they allow--not only did they allow this kind of weaker memory ordering, they often didn't

specify very carefully in ways that some of us could understand exactly what the reord--what

reordering was allowed, what the semantics were and what they weren't? And this was shared

between several architectures at the time, in any case. The result of this, as far as

I can tell, was general confusion and sort of weird--some weird things happened higher

up in the software stack which probably, sort of tended to aggravate the issue because it

also made it less clear to the hardware design as what was really expected of them. So I

put up a couple of interesting consequences here that we found in this process, I guess

GCC is commonly used here. So if you look at what GCC compiler provides, it actually

exports a primitive directly to the program that allows you to control the sort of memory

reordering. It basically has--it has a call called _sync_synchronize which is intended

to insure that memory operations are not reordered across this call. So it's defined in the documentation

as a full memory barrier. On the other hand, if you look at its implementation on X86,

it actually, at least last time I checked, actually generates a no op which it turns

out, by no reasonable interpretation of the X86 memory models, is actually what it should

do and be consistent with that specification. And the problem is, as we'll see from the

next slide, as--well, I'll hit that on the next slide, it's actually not clear that this

is fixable. Just fixing the implementation here probably won't solve the problem, for

performance reasons. In other interesting consequence or in other interesting situation,

that at least to me seem strange and I've--I have never gotten a complete explanation of

this from Intel or AMD engineers, is that the--around the time of Pentium 4, the architecture

introduced something called lfence and sfence, load fence and store fence instructions. On

the other hand, based on the understanding of the X86 memory model that most of us have

but which wasn't clear from the description at the time, loads and stores on X86 in fact,

are ordered anyway. So it's not at all clear what this do. So, as far as I can tell, for

most code, lfence and sfence on Pentium 4 are in fact something like a hundred cycle

no ops. This is not completely true actually, but--so there are some funny corner cases

where you might actually need this which makes this more interesting. If you look at the

performance assumptions that went into this, the assumption here was that by limiting memory

ordering to exclusive fence instructions, we only have--we only require the hardware

to enforce ordering of loads and stores precisely in those cases where we needed it. So we should

always be ahead. In fact, that's also not the way it worked out as far as I can tell.

So the--if we look at the performance of these instructions, and I'm sort of picking on in

this slightly obsolete chip here which was, probably there was no fender in some ways.

For a Pentium 4, the cost of one of these instructions that actually enforce ordering

was in general, on the order of 100 cycles in the best case, that's assuming no cache

misses, and so on. So that's kind of expensive. On the other hand, if I compare that to--as

far as I know, around that time, there was really only one architecture that really enforced

sequential consistencies at the hardware level and that was PA-RISC. PA-RISC in fact, was

reasonably performance competitive which suggests to me that--certainly it did not cost a 100

cycles per memory operation in order to enforce ordering on the PA-RISC processor. So in fact

the theory here doesn't seem to have quiet messed with reality either. The other interesting

thing that happened is that we ended up with some memory models in which it was actually

not at all clear that--the fence instructions that were provided by the hardware actually

could be used to get you back to full sequential consistency. And hence, if in fact this was

not a full sequential consistency is not implementable, it turns out you also can't implement the

Java memory models. So you can't implement some of the programming language as defined

on top of it. So the example here illustrates why this might be an issue. This is, by now,

semi famous example here, I guess called--referred to as the IRRW, Independent Reads of Independent

Writes example. So the idea here, this looks kind of complicated at first glance, but what's

going on here is threads 1 and 2 are just our independent writes to different shared

variables. And thread 3 and 4 read them in opposite order and convince themselves that

the other one is set first. So, thread 3 reads things in such way as to prove conclusively

that X was set first, and thread 4 read things in such a way to prove conclusively that--that

Y was set first. The question is can this happen? The problem is that it's not complete--it's

intuitively not at all clear that just by inserting fence instructions here, in the

middle of--between the two loads of threads 3 and thread 4, that as a result we preclude

this sort of outcome because one could still intuitively think of a machine where thread

1 and thread 3 are somehow located close together, for example the threads on the same processor

core. And thread two and thread four are located close together so they can see each others

writes first. And just inserting a fence there that prevents those loads from being reordered

and sometimes doesn't change much of anything. Yet this--this program is fully fenced. We

have fences between every pair of instruction. So it's not clear that we can do anymore either.

Okay. Going on here, the good news is that, this has actually gotten significantly better

recently. If we look at the status now, as I mentioned earlier, Intel and AMD in fact

published much better descriptions of the--the memory model of those--those processes which

clarify a lot of those issues. So in particular, if we look back in example on the preceding

slide, the IIRW example, it actually turns out that on X86, we can get sequential consistency

for that example. On the other hand, somewhat surprisingly, which I think nobody new before

the specs came out is that this actually requires that you implement the store, the stalls in

the first two threads into X and Y using in exchange instruction. So--so, actually the

good news is that we have a solution, the bad news is that I suspect we still have a

bunch of JBMs out there that in fact they're technically broken and don't follow the rules

because they didn't know what the rules were at the time at which they were written. In

general, on other architectures, similar things are happening and we seem to have fairly--we

seem to have fairly good stories pretty much everywhere as far as I can tell. They either

have fairly good stories or we've convinced the vendors that they have to do something

about it, so. Okay. Moving up a level in the stack here to--to languages and compilers.

The programming rules, I will claim, the programming rules for dealing with threads in general,

have been pretty unclear in some languages more so in others. Probably the--the language

that sort of in the best--that's in the best shape here is Java. The Java memory model

which defines basically the semantics of shed variables in Java, was fixed on 2005 and it

turns out there's some on-going discussions about some of this issues here, and in fact

the issue that I just talked about, about implement ability on various architectures,

was one of those that came up since then. So things haven't completely settled, but

there is a mostly consistent story there. I'm--if we look at .net and openMP, I think,

my personal opinion is that there's a lot of clarification that could be--could be used

there. And for the rest of this talk, I'll actually, as I hinted not too inconspicuous

in the title; I'll focus mostly on C and C++ where the assumption so far is--has been,

since these languages don't explicitly talk about threads that I am programming with the

base language plus the thread library, generally for my purpose is [INDISTINCT] because that

sort of has--it has this--is that's backed by a standard specification. So let's look

at C and C++ with its programming rules. The general rule, I think probably everywhere,

though I'm--certainly that's the case now, in the past that was--not completely clear

on--on Windows platform I think. But in general the rule is that we disallow simultaneous

access to shed variables, what's commonly referred to as the data case. So, we don't

get to access the same variable from 2 threads simultaneously if at least one of them is

a write. That's, for example, the rule if you look at the--the POSIX standard. Unfortunately,

many people who program with threads and C++ don't look at the POSIX standards, so. But

it's fairly clear that that was intended to be the rule. This actually is a rule that

goes back fairly far--fairly far. It was basically what was--what was prescribed in the 883 standard

and it was explored in the [INDISTINCT] thesis after that. It turns out that this solves

a whole bunch of problems. By just insisting that programmers not have data [INDISTINCT]

in their program, basically you can tell whether a lot of nasty stuff is going on under the

covers. So you can not tell, for example, whether your compiler reorders loads and stalls

underneath, because any program that would be able to detect whether two stalls were

reordered for example, would have to--would have to read those essentially at the same

time without synchronization. So any program that does that would have some execution with

the loads and the stores go on--occurs simultaneously so it would have a data race. So basically

it gives the implementation license to cheat without getting caught. So this happens--now,

this applies not only to compiler reordering, it also applies to the sort of hardware reordering

that I talked about at the beginning. So if the compile--now, the hardware is allowed

to delay store instructions by storing to the right buffer because you can't tell. Any

code that can tell is wrong, by definition. It also has the interesting consequence which

simplifies compiler writer's job significantly, is that the C or C++ compiler may in fact

rely on the assumption that the variables don't change out--don't change--asynchronously

change out from underneath it. This, in fact, potentially has some fairly profound consequences

which I think a lot of people don't appreciate. Though it seems to rarely encounter--this

seem to rarely occur in practice which is a common issue here. So here's an example

of the kind of assumptions that a compiler is currently allowed to make and what can

happen if that assumption in fact, turns out to be false. So we assume--we have some unsigned

integer variable X, and then we check that X is less than three. And then after doing

some other stuff that doesn't touch X, we execute a switch statement on X. Let's assume

that that switch statement is implemented by indexing into a branch table and branching

indirectly to one of the locations, depending on which case we're in. Since we previously

checked--checked that X is either zero, one, or two, the compiler is allowed to lie the

range check on the branch--on the indirect branch to the branch table and just assume

that that's within range. And that works so long as we follow the rules. Now, what happens

if in fact somebody doesn't follow the rules and asynchronously changes X between the conditional

and the switch statement? But what happens, say X is changed to 5 there by some other

thread in the middle? What happens is that the switch statement will merely go ahead

and use X to index into a branch table, indexing off the end of the branch table and branching

to whatever location is there causing some weird stuff to happen, basically causing the

program to take a wild branch. So the important thing is that, in this model, by introducing--by

introducing a data race, were we won't allowed to have one, we get something which in fact

doesn't correspond to just leading a different value from the shared variable. We get complete

non sense at the end. And that's allowed by current standards. So, what's the reality

here? As far as I can tell, the common attitude is that data races in fact, aren't so bad.

And that some data races--well, in fact there was a paper published in [indistinct] fairly

recently about the benign data races. And so, I mean, clearly in some cases people believe

that data races are okay. So, for example, the kinds of things that people commonly do

is that they maintain approximate counters in ordinary shared variables by just--by incrementing

the counter in multiple threads without any, sort of locking a synchronization. And then

reading the counter is synchronously--since these are approximately counters that only

supposed to be statistically valid, the assumptions is that, "Well, this may go wrong every once

in awhile because you have a race so you might actually lose one of the updates." On the

other hand, that's not so bad. Another fairly common idiom that people use is something

called double check locking. The idea here is that you lazily initialize a value and

then you--you avoid doing the expensive synchronization on every access to that--to that variable

by checking the--a flag that says "has this been initialize ahead of time" and only doing

the synchronization if it appears to have not been initialized. It's another one of

these things that--that mostly works as a fairly common idiom. On the other hand it

involves the data race. In many environments, people actually combine this with atomic operations

that are somehow provided in platform-specific ways. So on Windows, there's this interlock

operations. I already mentioned GCC has this--has _synch operations which are often used for

reference counting. And these things are often combined with reading the resulting value

using ordinary, sort of--using ordinary variable operations, using an ordinary variable reference.

All of these really isn't particularly well defined at the moment, and read access is

generally appears the data race to the compilers. So you have to worry about things like, what

happened on the preceding slide where you ended up taking a wild branch happening. Fortunately,

you know, this doesn't happen very often in practice, so we rarely get caught by this.

On the other hand, it seems fairly clearly against the world. On the other hand, we have

the problem that locks are expensive enough that they're--you can't quite tell people

does not--not to do this, properly synchronize your program because--especially for something

like reference counting in C++ which is fairly prevalent. The cost of doing this any other

way is in fact is--is too high. Okay. Here's an interesting complication which--which always

enters, very often enters the discussion at this point when I bring this up, and that's--well,

if you share these variables and they can--they can change asynchronously, I--shouldn't you

just declare the variable volatile? And the answer here, unfortunately, is sort of, it

depends. So, if we look at the current Java definition, after 2005 at least, then the

answer there is fairly clearly "yes". For something like double check locking or and

some--some other idioms, in fact that works just fine. So in Java, if they have an initialization

flag there called X and it and I declared volatile, I then assign the X and set the

initialization flag, that in fact guarantees that X is set before--that X is visibly set

to other threads, becomes visible to other threads before X and it becomes visible too

in other threads. So if I check X and it in another thread and then read "X, I'm fine."

This is often implemented by--by putting a fence--a fence, a hardware fence instruction

before the volatile is stored there. And there's other stuff that's required to--to implement

those. I want to point that out explicitly because there's an interesting contrast here.

In ConTask, if we are programming in OpenMP, here's a quotation from the OpenMP 2.5 spec

and I think there's actually a small and unmodified in the 3.0 draft. A reference that modifies

the value of an object with volatile qualified types behave--type behaves as if they where

a flash operation on that object at the next sequence point which essentially means that

in OpenMP, the implementation shouldn't show defense after the assignment to X and it if

I declare X and it volatile here and which I haven't yet figured out why it says that.

As far as I can--its--I've had a lot of trouble figuring out any used case in which that's

actually what you wanted to say, but nonetheless it's clearly not consistent. I am--if we look

at the--see where [INDISTINCT], the relevant quotation here, because it makes it clear

rather than what's actually in the standards as a quotation from Dave Butenhof was one

of the people who wrote the standard there originally. Volatile provides no help whatsoever

in making quote that safe. And if you read the standard carefully--carefully that's what

it says. Not--not as clearly but... So we have a bunch of different--really different

situations here. So, what's actually in this--in the C++ 0X Working Paper? Basically, again,

largely thanks to Lawrence here, concurrent access to--to variables is allowed but they

have to be declared especially as atomic objects. So, for example, you can declare a variable

as having an atomic int--as being of type atomic int and then that provides operations

to concurrently access the variable safely. And for all purposes, the concurrent access

is to an atomic int are no longer viewed as a data race. That's the way this is--this

is dealt with formally. Other than that, the C++ Working Paper actually doesn't make much

change to what's actually--what's currently the official status quo. It again reaffirms

that there are no benign data races in C++. So where--wherever you have a data--the other

way to say that is, wherever you have a data race, you should be using atomics. Otherwise

the behavior is undefined and that includes talking wild branches. It's important to be

explicit here that C+--the C++ volatile qualifier in fact, still has of no help here, so we

haven't changed that either. There was a lot of discussion about changing this, it turns

out it's very difficult to do because C++ volatile in fact, has some current semantics

which would be difficult to preserve if we also used it for this purpose. So that's why

the decision was made to instead invent something new, call it atomics rather than volatiles

in C++, even though that introduces this unfortunate translation problem between C++ and Java,

in that Java volatiles are all C++ atomics, and C++ volatiles are not Java volatile. And

it turns out there and in fact, some other, more technical problems here that come up.

So far we said--we've told the programmer to not write code that includes data races.

On the other hand, it turns out that compilers in fact, sometimes introduce code that adds

data races which is--is rather unfortunate. There's one sort of fairly well known case

of when this happens, and that's when update--when you updating a small struct field. And I'll

have some detail--I'll give you a detailed example of that. It turns out this can also

happen in more complicated cases if--if clever optimizations result in speculative stores,

basically. If they result in writes, back to locations that weren't written in the source,

the compiler will have to be careful that what's its writing back is what--what it thought

was there before. On the other hand, in a multithreaded application that's--it's not

safe to do that necessarily because by just writing back what was there before, you maybe

hiding an update in a different thread. Very occasionally, this actually does happen and

the annoying thing is that if you look at the current standards, it's not at all clear

that this is this disallowed at least. So, for the struct field update example, which

is probably the one that's been--the most people on practice. And assume we have--consider--well,

consider the structure up there which is sort of a particularly tricky case which is in

fact by the current working paper, miss-compiled by nearly every compiler. So, what happens

is I have a bunch of bit fields in between two character fields in 32 bit struct. And

I assigned one of the bit fields in one of the character fields simultaneously. The way

that's normally compiled these days is, say on a, particularly on 32 bit machines, but

probably also on 64 bit machines, is the--is I copy the structure as a whole into a temporary.

I implement the bit field update by updating the bit field in the temporary value, and

then installing the result back to the whole structure. On the other hand, the character

field update in the other thread is implemented simply as a byte store, so that's fine and

uncontroversial. It's the bit field update there in thread 1 that's controversial. And

as I suspect, you--most of you already realize this isn't going to work, so we can actually

step through this here quickly. So if we look at how this is actually--this can actually

be executed, we copy X into a temporary, we update the C bit field in the temporary. Now,

we assign to the D field in the other thread, and we copy the temporary back to X, and we've

just overwritten the update to D. So what happens here, as a result, I've lost the effect

of thread 2. So, as I already said, this behavior is actually something that's both allowed

and fairly common. If you look at current standards in fact, it's worse than that. This

is what implementation typically do. Implementations typically run into this problem with bit fields.

The standard actually allows this to happen even if I'm just writing to fields A and D

in the structure. Yeah? >> Suppose that X, instead of being a structure

with bit fields, was, let's say, an array of X.

>> Yeah. >> And let's suppose thread 1 wrote to a race

of zero. >> Yeah.

>> Thread 2 wrote to a race of 22. >> Yeah.

>> No one would expect those three atomic because X is not atomic but writes to zero

and 22 would be, so if you don't expect that when an X is an array, why would anyone ever

expect it to be true in this case? >> Oh, okay. I--the question if I can summarize

it correctly, is why would you expect this to work because X is not declared to be atomic?

>> Everything [INDISTINCT] right? I would never expect that to work for the same reason

I would never expect the array case to work. >> Okay. In fact, I am not sure I quite understood

the array example you asking about simultaneous updates to two array elements.

>> I did not mean to blow up the talk it is just that, it--it...

>> Yeah. >> It is just, I do not--I do not think...

>> Oh, okay. No--no. That is--this is a good question. I think it's good to be clear about

this. In fact if you look at a lot of existing code; it makes assumptions along this lines

usually it is in the form of--I have a structure and it has, say, these fields in it and I

would like to protect C and D by different locks.

>> [INDISTINCT] proposal. >> Just one rule so far or a slight [INDISTINCT]

never use [INDISTINCT] how about that. >> I...

>> I mean, it might be to follow that rule now.

>> ...I, in some applications that--in some applic--well, I actually, let me postpone

that and do--what we do is actually some way in the middle which I think is--is probably

a bit--well hopefully you agree it's a better solution but...Okay. So here's a quick summary

of what we actually say about all of this in the current working paper. So subject to

the no data races rule and--and in response to that question, I should be clear that the

no data races rule applies at the level of scale of values so you may not simultaneously

access two scale of values at the same time from the different threads. So, in particular,

this--that does not automatically prevent simultaneous--simultaneous access from two

threads to distinct field--distinct array elements. Subject to that, well, each update

affects a memory location which the standard defines as either a scale of value or contiguous

sequence of--of bit fields. The standard then defines exactly which assignments can be seen

by each reference to a memory location which is essentially what you need in order to define

the semantics of--of shared memory here. For ordinary accesses, there has to be exactly

one assignment to a particular memory location that's visible to a particular access, to

a particular load. The way this is structured is if we look back at the preceding example,

a reference to X start D, I--after completion of both of those threads and the last example,

in fact, as required to see a value of one. And because those two threads access distinct

memory locations, they, in fact, are not defined to be involved in a data race. So as a result,

the preceding implementation of bit field assignments as I outlined it there is incorrect.

So we no longer allow that implementation. Okay?

>> Is this supposed to address, you know, 32-bit values on AFid processors or anything

like that or is that going to go? >> Thirty two bit--thirty two bit values on

AFid processors in fact, work out generally okay because of the fact that we don't allowed

races. So that--that's in fact not an issue. >> Okay. I am fine.

>> We do not. There--there's an issue in the implementation of atomics on such processors.

>> My question like this... >> For non-atomic speakers, we do not allow

simultaneous access with out some sort of synchronization between them. You can't tell

whether a particular 32-bit stores implemented in four pieces or in one.

>> Okay. Good. >> So, as a result of this, if we--the proceeding

implementation is not okay. On the other hand, assignments to the--the bit fields in the

middle, in fact, may not occur concurrently. Those--those two forma data race if I have

one [INDISTINCT] assigning to bit field X stat B and one to X stat C, and hope--hopefully,

that answers to the last question. So that's sort of a compromise solution. You can still

use bit fields but they have weaker semantics then you might otherwise expect. But it only

affects bit fields, it does not affect the adjacent character fields. This will actually--it

turns out has some other more subtle consequences which are of interest primarily, the compiler

writers. I mentioned earlier this notion of speculative stalls introduced by compilers.

So here's an example of the kind of optimization that--abstractly, you would kind of like to

see in your compiler, on the other hand, it's not one that we allow anymore. So, this is

a simple loop that traverses a link list and counts the number of positive elements in

the link list. I am--it's fairly common to transform this to a loop that initially--I

should say--I should emphasize here, the account on this particular case is a global variable

that might be shared between threads. It's fairly common in current compilers to optimize

this to code that loads the initial value of count and to register before the beginning

of the loop. Then traverses the length--then executes the loop that traverses the link

list just--in commanding the register value, and then installing the register value back

to the global at the end. This transformation is not technically safe with more than one

thread because if, for some reason, I know that I'm invoking this on a list with no positive

elements, I know that the original code never touches count. So there is in fact, no data

race involved if I run this concurrently with another thread that does update count. On

the other hand, the transformed version always stores to count. So I've introduced--the compiler

potentially introduces a data race by performing this transformation. There are ways to avoid

this in this example, so in this particular example, you can--if you perform the store

only if--only if the register value is non zero, that will generally take care of the

problem--sorry, actually you have to do this a little differently. You have to count in

a local variable; you have to initialize a local variable to zero. And then if you change

its value from zero, then you store--then you add it at the end. That will work. Okay.

So the consequence of this is we are losing some useful optimizations. On the other hand,

it does give the programmer a much--a much simpler and consistent story than what we

have before. It prevents some really mysterious compiler-introduced bugs. And it outlaws,

primarily optimizations that we can at least replace by code that's as far as we can tell,

is only slightly slower. So I don't think this is going to have a major impact on code

performance. So that was, sort of at the low level. Here's another interesting problem

actually. Another one that Lawrence has been spending a lot of time thinking about. This

one is--this one is, unlike the previous one, its C++ specific. So what's happening here

is fairly commonly what--I think it's fairly common to write code which at some point doing

its execution starts some demon helper thread. So we have a main thread running. At some

point, the helper thread here stop it. And at some point then the main thread exits.

In a C++ program, when the main thread exit--exits various static destructors, destructors for

statically allocated variables get invoked and go away. So at some point, say there's

a variable down here which is used by this--by a shared library that's referenced from the

demon thread. The shared variable that's used by this library goes away, because it's destroyed.

Now the problem is, in this model, the demon thread is still running. There's nothing to

prevent the demon thread from accidentally waking up sometime after that and trying to,

call in to that shared library now accessing a very--one of the static variables of the

library that has already been destroyed at that point. As far as I can tell, this a fairly

common scenario in multithreaded C++ programs which I suspect hasn't generally been identified

as a major problem because processes only cache when they're about to shut down anyway,

and nobody notices that that much. But nonetheless, it's not a good state of affairs. For this

particular one actually, we don't have a, what I would consider to be a complete solution.

The current state of affairs in the working paper is I, we either--you basically have

two choices; you can either shut down all threads properly before process exit. That's

not necessary an easy thing to do because some of those threads maybe waiting on I/O

and you don't have a very good mechanism for getting them to break out of the I/O wait.

All--the other alternative is in, this is Lawrence's invention, there's a way to exit

the process by only executing certain special clean-up actions and not distracters. Not

all distracters, at which point you sort of get a semi graceful exit which is guaranteed

not to cache however. On the other hand, some of the things that you expect to be cleaned-up

might not be cleaned-up. Okay. Another problem here. Shared libraries. Part of the story

in making all of this work is that, in order to prevent races in programs, we need to have

synchronization operations like lock and unlock in this slide that ensure mutual exclusion

and so on. And we would like them to--to prevent memory reorderings with respect to those synchronization

operations. So if we look at this particular example here, this basically just commanding

X inside--inside the critical section. We suddenly want to make sure that the implementation

of this synchronization [INDISTINCT] such that we can't just take, say the load of X

into the tenth variable and somehow we reorder that out the critical section. So we want

to prevent this reordering from happening either by the compiler or by the hardware.

There are a couple of mechanisms that are normally used for this. We normally prevent

this reordering by the compiler by neglecting to inform the compiler what lock and unlock

are at all. So they look like a fake function to the compiler. And therefore it looks like

the compile--to the compiler, like lock and unlock might touch X, and therefore you can't

replace--you can't do this sort of movement of the memory operation because that looks

to the compiler already like it might be incorrect. On the other hand, the other part of this

has to be that we have to prevent the hardware from doing the equivalent of that. We normally

do that by putting fence instructions into the implementation of lock and unlock. So

clearly this is disallowed. These fence instructions are fairly--typically somewhere between fairly

and very expensive. So we need to be careful that we understand exactly where they're needed.

On the other hand, so far, actually, it's been fairly unclear what kind of ordering

is actually--what kind of the reordering is actually allowed and therefore, what kind

of fence instructions we allow. So, if we look at Java, for example, it's fairly clear

we're not allowed to move things out of the critical section as one I might expect. We

are allowed to move things into the critical section in both directions. If we look at

the Pthreads standard and read it very naively, basically, what it seems to say is this that

we can't move things either way, these things just prevent reordering and cost them. If

we study that really hard and think about what it really means, you can prove that you

can do this and the result is not observable. It actually turns out you can't 1do that.

You can't move things in pass the lock. And the reason is this example. And as I said,

"Don't ever write code like this." So the idea here is that I set some variable, X equals

42, and then say--say I am done with accessing variable X by acquiring a lock which is some

of the opposite of the way things is supposed to work. I then go ahead and wait until the

lock is set, which I can do by calling "tie lock" repeatedly. And then I claim that, at

this point here, it should be case that X is 42. The problem is, this should work, this

is expected to work in some sense at least with sufficiently [INDISTINCT] or something,

but it breaks--if I reorder the assignment of X equals 42 and the lock acquisition. I--so

that means that in this particular case, essentially I can't move the assignment of X equals 42

into the critical section pass the lock. So what do actual implementations do? Well, we

looked at some open source implementation a while ago. This is--I think fortunately,

this isn't also probably improved. So, a bunch of implementation--on weekly ordered hardware,

NPTL locks tended to, in fact, assume that we can get implementation here. So they in

fact tended to--to allow movement of things into the critical sections in both directions

but not--but not out which in practice is probably okay, but technically incorrect.

I'm--some of them, I think often, just because the hardware, sort of gets stronger, guarantees

anyway, implement this where nothing moves either way. In some cases that's probably

true. That's stronger than required. I think on [INDISTINCT], this is probably too slow.

Some of them did the right thing. And we found one interesting case that did that which you

don't want to use. I think that's about fixed. Okay. C++ erects a solution to this. It's

slightly weird. So movement into critical section is allowed in both directions. So

how do we deal with that funny example that we didn't want anybody to write? Well, we

tell you that tie lock may in fact fail spuriously. Which turns out--it makes the example that

I have up there clearly incorrect. So it still behaves "I no longer have a memory consistency

problem, I just have a wrong of code." Which actually was the goal here, I claim. Because

this is, as I said, this is not coded, you should be writing anyway. And I know of no

cases, but I'm told that there are. Some really obscure cases where this actually interferes

with legitimate code on uses of tie lock that I've ever made continue to work under this

interpretation. So, together with all the other rules, the important point from the

slide here is that by--by using this slightly weird definition, we basically allow the standard

to get emptied that memory operation reordering is invisible--is completely invisible for

the program--to the programmer for data race reprograms. So even if you use tie lock or

whatever, you don't have to--the bottom line is you don't have to worry about all this

reordering stuff that we've been--that we've been talking about here. That's not quiet

true because in fact there are some low level library facilities that actually allow you

to write really fast hand tune code that violates this rule. However, that's very clearly identified

when you're doing that, and so long as you don't do that, you get that guarantee. So

without doing that, basically, data race reprograms behaves as though the steps are just [indistinct]--of

the individual threads are just in the [INDISTINCT]. Let me see. I have a couple more quick slides

here. Libraries, in general, they don't steal specifically with threads. I think there's

also been some missteps here. The general wisdom about locks and, for example, general

purpose container libraries, I think its safe to say is that if you're accessing a general

purpose container library in multithreaded code, the client knows whether or not that

container's being shared between threads so the locking should occur in the client. This

doesn't apply to things that specifically are designed for threads use. Their--the rules

maybe different. If you don't do that, you get the--we have performance problems and

possibly deadlock. And there's a lot of experience to substantiate that at this point, I think.

On the other hand, if we--if you look at what we actually have, its sort of mixed and pop

because people were learning at the time. So the original Java collections, like vector,

are synchronized implicitly and violate this rule. And on the Posix side, in fact we have

this problem as well. There's this interesting phenomenon that if you write a simple file

copy program using--using put C and get C in Posix and tie this on your favorite Linux

machine, if you fork an empty thread at the beginning of this copy program, which immediately

returns and does nothing, it slows down by about a factor of 10. That's because Posix

requires that put C and get C implicitly acquire and release a lock. So they're [indistinct]

safe and not very strong in usually unnecessary sense and tells us that behavior is triggered

once on Linux, at least once you fought the initial thread. So, the C++0X solution here

is that we basically follow what's sort of the STL convention in the recent Java convention

which is that containers are not seem--do not acquire--do not visibly acquire a lock

by default. The details here still have to be determined; we're not really done with

this thought yet. So where does this leave us? The one thing that--this has taught me

at least, is that if we start out with a single threaded language and other threads as and

afterthought, that, in general, doesn't work all that well. Its--I think we really do need

to get the remaining problems fixed as--that we introduced as a result of this. Its quiet

likely that there actually is a better general purpose parallel programming model, at least

for certain application. I'm not sure there's anyone that's quiet as general. On the other

hand, I think it's useful to be able to fix some of the threads and locks programming

model underneath if for no other reason then there we can actually get a reasonable evaluation

of the alternatives as a result. And if you look at these alternative programming models

which generally underneath in the implementation as threads and locks anyway, so we better

get that right. Okay, thank you. Any questions? >> I know--I know in Java there's a concurrent

[INDISTINCT] which basically tries to put things into separate HashMaps in order to

improve performance once being accessed concurrently. Where does that fit into this idea that the

client does all the locking, because it does not have visibility into the internal implementation

of the data structure? >> I was--time to be fairly careful to--to

weasel out of this but initial. . . There are clearly some data structures that we'll

need to know about threads specifically, and that's an example of that. That are specifically

designed for multithreaded access. So in that particular case, it's not a data structure

you should ever use if you only have single threaded access. So the idea is--the interface,

basically, is designed for concurrent access and as a result, the implementation can take

advantage of that. That certainly makes sense. So I'm all in favor of providing that sort

of facility in addition to the basic containers. >> Earlier in the talk, you said certain kinds

of race conditions are common, but failures from them are fortunately rare. I was a little

surprised to hear you use the word "fortunately". I've been writing concurrency bugs for several

decades now, and I am always happy when they show up early and sad when they show up late.

>> I--I actually agree with you. I was not being entirely serious when I said it that

way. >> I wonder if there are any facilities, contemplated plans, or recommended for improving

the things you didn't want to do but did anyway. You know, when I write those concurrency bugs

for the hundred and first time to help--help me find them sooner. Are there language features

or processor designed features or programming practices to help?

>> We've been thinking about this a little bit. I think it's clear that in the eventual

story he has to invoke a lot of help from tools. And some of those are out there. We've--whether

we had a choice in some sense, we've tried to make things a little bit easier for the

tools. So the sort of prohibition absolute prohibition against data races here in some

sense makes things a little bit easier for data race detection tools because if you can

detect that you actually have a race, you have a bug, there is no fault, no notion of

a false positive in that sense. I--but I think that mostly an orthogonal issue. It's an important

issue but I don't think we have any great answers there yet.

>> Fairly early on, one of your first slides said that some of the objections to threads

were that they're dangerous and I forgot, you know, "threads are evil" is the thing

you put up there. You haven't been particularly considering the previous question. You haven't

done anything to convince me of otherwise yet. I still don't see any reason to trust

anyone in this room, particularly including myself to write threads correctly. I mean,

not implement them but to use them correctly. Is there any--is there anything going in the

C++ standards to help address that, to help up try to trust more people to use this incredibly

dangerous tools? >> I think there's a consensus that there

is a serious problem there and that probably most programmers should be using higher level

libraries that are built on top of these. So the problem at the moment as far as the

C++ committee is concerned is primarily one of timing. And that's we were under a relatively

short deadline and we have to get sort of, we have--hopefully enough time to get the

foundations right. We unfortunately don't currently enough time to get those things

in but everybody realizes that those are important as well.

>> I just want to address that last question, too. It is true that treads are still hard

to use and evil but it--if you don't know what your treads mean, then it is completely

and utterly impossible to write correct code. Not just merely hard. So this is--this is

a really important [INDISTINCT] to actually to find what this--what it means when a threads

actually executes. Thanks. >> Any other questions?

>> If I could get there. >> Oh, okay.

>> Which libraries are you or would you like to see implemented but obviously don't have

time to implement? >> There are--we haven't invested as much

effort into this I would like. It's sort of the kinds of things that have been under discussion,

probably only--only, sort of one level up and maybe not--not high enough. I mean, we've

been looking at threads pools and so there--threads building blocks kind of--kinds of approaches.

So Intel has proposed the threading building blocks library to the standards committee

and that was sort of pushed back for timing reasons. I think that's probably not high

enough. I mean there are lots of--lots of small things that need to be added as well.

I'm just--there's some interesting things we keep stumbling over just like getting reference

counting right. It's really hard. Write and really have them optimal is really hard and

that's certainly something that should be provided in the library. And there are--there

are bunch things like that. >> So my question is about shared variables

and can you make them more safe by disallowing things like address taken on that? So that

it is easy for analysis to catch bugs. >> We haven't--we haven't really been looking

at disallowing things like that. I mean, certainly you can--you're right. I mean, you can make

the analysis easier by not using those features. I haven't heard much of an argument to actually

disallow them in the language. I think it would be problematic to do that at this stage,

but unfortunately. . . >> [indistinct] make analysis much more usable?

And once it addresses on--on shared variables, anything [indistinct] on.

>> Yeah, on--on... >> Its hard--hard to track them around.

>> On the hand, I mean, you truly can certainly detect violations of that rule, and there's

nothing to prevent you from warning violations of that rule.

>> Right. >> So you're free to do that in anyway in

some case--in some sense. So I'm--I would be worried about trying to convince people

to actually restrict the language in that way because its very hard to convince the

committee to outlaw something that's in fairly widespread use and particularly, if you can

work around it somehow. >> It seems like the Java world has spent

an awful lot of time putting things like concurrent containers classes and that sort of stuff,

into their standard library. Do you ever see that happening for the C++ standard? And if

so, how many years are we talking? Is this, you know, five years, ten years?

>> Its actually conceivable that this may happen shorter--in shorter time period. I'm

not sure. I have a--probably Lawrence, who's sitting there, there's more experience of

predicting how the committee behaves. But there is an effort to put out another library

technical report which is, sort of a non binding addendum to the standard on a fairly short

time scale. In fact, at some point there was discussions of--there were discussions of

possibly getting that out earlier than the actual next standard because there fewer bureaucratic

hurdles that has to go through. So it's conceivable that it might happen in the--within the next

two or three years. But I'm not very good at predicting this things.4084sec

>> Historically, the compiler development process usually involves something like, right

a compiler, convince yourself its close enough to write, send it out the door, somebody finds

a bug, sends in a bug report. Certainly, one of the problems with concurrency related bug

is they're very often unpredictable and it's hard to put your finger on the bug. And I

wonder if you have recommendations about things a compiler writer can do to help verify that

they're following the concurrency related rules to reduce the number of this area hard

to pin down errors that the compilers are introducing.

>> I think that's an interesting problem. It's probably also still a research topic,

unfortunately. I think some other people have been working on building infrastructures for

testing concurrent programs to try to, sort of increase the cost section these bugs. But

I don't know of any good solutions. I mean in some sense--in some cases we would trying,

to sort of minimize the number of possible un-testable bugs. So, I mean, the absolute

prohibition against data races, again, is probably a good thing because my feeling is

that current compilers makes the assumption that there are no asynchronous changes. And

those assumptions, sort of rarely and in fairly unknown context, actually impact the generated

code. So even though this--violating that assumption would rarely, actually cause the

sort of behavior that I illustrated within switch statement. I think testing for its

absence would be absolutely helpless. But, I mean, other than that, I don't have any

great ideas. It's still a hard problem, I think. Okay, thank you.

The Description of Getting C++ Threads Right