>> Why we are interested in Threads? Well, this is sort of fairly canonical argument,
future process of performance enhancements. As we all know, likely to come from multi-core
processes. It's going to come from having parallel machines rather than enhancing single-core
performance. Often they'll [INDISTINCT] not in all cases especially--especially around
here. Program ability is a limiting factor and there is in fact lots of research on alternate
programming models for multi-core processes of parallel machines in general. There are
many people working on transactional memory. There's a lot of interest in message passing
models, clearly. There's also interest in data--in more flexible data parallel models
that we could use to program this machines. On the other hand, if we're talking about
parallelizing a single non-numerical application, I think if we look at the status quo at least,
pretty clearly threads and locks actually dominates, particularly if you look at what's
going-on on desktops. The main reason that they do I suspect, is essentially because
they're, threads and locks, have been around for a long time. They clearly predate the
widespread use of multi-processes--of multi-core processes. They were there primarily to deal
with multiple events claims and things like that that had nothing to do with parallel
machines. They where invented primary--they where originally used primarily as a programming
model for other reasons, but that makes some widely available and it has--and the natural
mechanism to exploit. On the other hand, if we look at threads and locks in general, in
many cont--in many circles at least, they have a bad reputation. Certainly we all have
horror stories about intermittent failures of multi-treaded applications. There's a general
feeling that they're too complicated. We commonly hear statements like locks don't compose which
I think is somewhat true, though it's actually hard to make that precise. If you do--if you
do a web search for the phrase "threads are evil" you'll get lots of hits. So--the point
of this talk really is that, not so much that these problems aren't real at all; I think
some of these problems are real and we should be looking at alternate programming models
as well. On the other hand, not all of the--not all of the bad reputation here is inherently
deserved. Not all of it is really rooted in profound problems. Sometimes the issue, I'll
argue, is just that we, essentially we got things wrong the first time and we haven't
gone back and fixed some of these things. Up until now, a lot of this wasn't that important
because parallel machines where multi--basically multi-process and machines where irrelevantly
esoteric, most people didn't deal with them. On the other hand, now that's clearly--clearly
changed, and in fact I think it's important to fix these problems because now they aren't--now
they have become important and I'll try to convince you that most of these problems in
fact are fixable. The problems aren't really inherent. I am going to be a little bit sloppy
here about what's actually my work and--and other people's work here, there were many
contributors to a lot of this. There's a partial list up here. Last time I checked, two of
the people up there actually work for Google. That may have change by now, it's hard to
keep track. But--so there are many other people who worked on this. As always, when looking
back, it's easy to complain about stuff that was done in the past. In fact, I don't know
if Ill get to Sarkovski, some of what was done in the past is in fact quite clever and
some--and it's quite cleaver but it did not quite make it. It's sort of--it's a very interesting
near miss, in my opinion. So it's very easy to take cheap shots at stuff that was done
to 15 years ago. He and I--I apologize for that, but nonetheless I think these are things
that need to be corrected. Some of the solutions here0--some of these problems actually are
difficult technical problems, some of them are not. So some of the solutions I'll point
to here are in fact quite obvious when you look at it and in some sense its part of the
point. We, in some cases, got--even got things wrong that are in fact, I think, not that--that
propound. And--so as a group, what we've been doing here is basically trying to identify
some of the problems that we--that we have with multi-threaded programs at the moment
and trying to resolve those whenever--whenever that's possible. And there are a number of
things we have been trying to do. Some of the technical solutions in fact turned out
to be non-trivial, and I will try to get some hint of that at least here. So working on
the technical problems are certainly been part of this effort. There's also been an
effort to try to push some of the [indistinct] C++ standards committee. Lawrence [indistinct],
he was in the room here, has been heavily involved with that as well. So the plan here
is to--the goal here is to provide proper support for threads in the next C++ standard
which is, for historical reasons--well, maybe historical reasons, it's not quiet clear yet,
as refer to a C++ 0X. The joke here for a long time was that X may actually be 0XA.
The status of this, as of the moment, is that the current working paper for the next C++
standard actually includes a lot of what I'm going to be talking about, but not everything.
So it includes the threads memory model that I'll be talk--I'll touching on here which
basically defines what it means to act as a shared variable from a thread, which is
sort of the foundational question that we're addressing here. It also includes, with things
in large part to Lawrence, in atomic types and operations library in the current raft.
And it almost includes the threads API as part of the language definition as opposed
to the definition of the auxiliary library that you use in order to get threads. I--we're
trying to do some other things as well here. In addition to the C++ committee, we've--several
of us have been talking to the C and Posix committees. Currently, at least my understanding
is that the C committee is fairly likely to--to do things in a way that's similar to what
the C++ committee is doing, which will be a very good thing. We've also--several of
us have been spending time talking to the--talking to other interested parties. And the other
interested parties who turns out here, mostly processor vendors, in order to address some
of the other issues. In particular this--the issues here, as I'll show you, extend down
to the hardware level in a few cases. So one of the things that's actually improved dramatically
recently, from my perspective, is that both Intel and AMD have released new memory model
specifications for the--for what's commonly known as the X86 architecture which resolve
some of the issues that I'll tell you about later here. And we've been trying to sort
of spread the word about some of these issues to try to resolve past misconceptions. The
rest of the talk here is basically structured as a list of problems and then explanations
of the solutions that we expect to use in order to resolve those problems. Most of these
in fact are in this--are now in the C++ COX working paper, but I should warn you they're
not yet in a formal standard, so anything can still change at this point. So, again
the main goal here is try to convince you that first of all there are problems here,
on the other hands this problems are solvable and there's a solution--for the most part,
there's solutions on the way. So, starting with the list of problems here that we encountered,
I actually want to start down sort of the lowest level, at the hardware level. If we--if
we're talking about multi-threaded programs, basically, everything relies on--on what it
means to assign to a shared variable and to access the value of a shared variable. That--that
basically gets as to this issue of memory consistency or memory ordering which is underlying
most of these. In order to get you better feeling for what we talking about here, if
you're not familiar with the issue, I will give you one simple example. If we have two
threads, one of which--one of which assigns to X and the other one assigns to Y, both
of which are shared variables, and then each thread leads the other variable. The question
is, can--can both of these threads see the--see the initial value of the variable, fail to
see the assignment in the other thread. With most people's intuition of how threads work,
that shouldn't be possible because one these threads has to go first intuitively. That's
one of the assignments of one has to go first, and therefore the other thread should be able
to see a one. There are in fact some basic algorithms that rely on the fact that this
can't happen. And a lot of programmers when they first start to program with threads basically
assume that this is the case. They implicitly assume what's--what's normally called sequential
consistency which I didn't informally define as the threads behave as though the steps
of the individual threads are just in the [indistinct]. So, by the normal interpret--by
people's normal intuition, this outcome here should not be allowed. Whether or not this
allowed--is allowed in fact is determined by all sorts of things, including the hardware,
the compiler and thread libraries and so on. And--and basically, everything on--since everything
is built on top of the hardware, for all of these things to do their job correctly, to
sort of provide the right semantics here, we need to get the understanding of the hardware
right. The interesting observation here is that if we look at this particular outcome,
most hardware in fact, even if you assume that the other layers of the stack, in fact
translate this directly to hardware instructions the way you expect, most hardware will in
fact allow this outcome here. Most hardware will in fact allow both of those to read zero,
the reason for that is that most hardware, when it executes a [INDISTINCT] instruction,
in fact does not make the results of those--that [INDISTINCT] instruction immediately visible
to other processors. It just goes into a right buffer and the processor where it's not visible.
So if this execute, basically, exactly in lock step, then the initial store may not
actually have made it to visible memory yet, or to a coherent cache by the time the second
statements were executed, so both of them may in fact see the initial value. So what
happens here? Well, fundamentally, the fact that the hardware doesn't quiet implement
what we expect to see, shouldn't be a problem because we have that in all sorts of other
respect as well. We normally just try to--try to hide any sort of miss-features like that
in the software--higher up in the stock, right? So provided we actually understand the hardware
rules, we should be able to exploit those hardware rules to implement some usable programming
model that we can all agree on, and that--that we can then program to. And that can be implemented
reasonably on the hardware, possibly by telling the hardware something else. Widely held beliefs,
at least as far as I can tell about this, that in general allowing the hardware to cheat
like this and give unexpected results is probably a good thing because what happens is that
most of the time we actually don't care about simultaneous access as to shared variables
like that. As you'll see, we generally try to discourage people from introducing simultaneous
access like that anyway. And most of time we don't care, and since we don't care the
hardware should be allowed to do whatever is cheaper which includes giving us these
unexpected outcomes. Occasionally we will care about outcomes, about disallowing things
like the outcome on the last slide. In that case, what we do is we provide, and hardware
generally provides other instructions that allow us to notify the hardware that we do
care about, the memory, disability ordering is a normally called memory fence instructions
on most hardware. They take that form in any case. That allow us to get back to sequential
consistency to the behavior that programmers expect. So, in those cases, basically, the
software stack can hide the fact that the hardware will reorder this memory access as
underneath the covers. So that's the theory. So what actually happened? Well, if we go--as
I said, recently things have gotten better here fortunately. On the other hand, if we
go back for--to--sometime around 2006, in fact, if we look at many of the architectures,
what actually happened was that, not only did they have somewhat weaker memory ordering,
they allow--not only did they allow this kind of weaker memory ordering, they often didn't
specify very carefully in ways that some of us could understand exactly what the reord--what
reordering was allowed, what the semantics were and what they weren't? And this was shared
between several architectures at the time, in any case. The result of this, as far as
I can tell, was general confusion and sort of weird--some weird things happened higher
up in the software stack which probably, sort of tended to aggravate the issue because it
also made it less clear to the hardware design as what was really expected of them. So I
put up a couple of interesting consequences here that we found in this process, I guess
GCC is commonly used here. So if you look at what GCC compiler provides, it actually
exports a primitive directly to the program that allows you to control the sort of memory
reordering. It basically has--it has a call called _sync_synchronize which is intended
to insure that memory operations are not reordered across this call. So it's defined in the documentation
as a full memory barrier. On the other hand, if you look at its implementation on X86,
it actually, at least last time I checked, actually generates a no op which it turns
out, by no reasonable interpretation of the X86 memory models, is actually what it should
do and be consistent with that specification. And the problem is, as we'll see from the
next slide, as--well, I'll hit that on the next slide, it's actually not clear that this
is fixable. Just fixing the implementation here probably won't solve the problem, for
performance reasons. In other interesting consequence or in other interesting situation,
that at least to me seem strange and I've--I have never gotten a complete explanation of
this from Intel or AMD engineers, is that the--around the time of Pentium 4, the architecture
introduced something called lfence and sfence, load fence and store fence instructions. On
the other hand, based on the understanding of the X86 memory model that most of us have
but which wasn't clear from the description at the time, loads and stores on X86 in fact,
are ordered anyway. So it's not at all clear what this do. So, as far as I can tell, for
most code, lfence and sfence on Pentium 4 are in fact something like a hundred cycle
no ops. This is not completely true actually, but--so there are some funny corner cases
where you might actually need this which makes this more interesting. If you look at the
performance assumptions that went into this, the assumption here was that by limiting memory
ordering to exclusive fence instructions, we only have--we only require the hardware
to enforce ordering of loads and stores precisely in those cases where we needed it. So we should
always be ahead. In fact, that's also not the way it worked out as far as I can tell.
So the--if we look at the performance of these instructions, and I'm sort of picking on in
this slightly obsolete chip here which was, probably there was no fender in some ways.
For a Pentium 4, the cost of one of these instructions that actually enforce ordering
was in general, on the order of 100 cycles in the best case, that's assuming no cache
misses, and so on. So that's kind of expensive. On the other hand, if I compare that to--as
far as I know, around that time, there was really only one architecture that really enforced
sequential consistencies at the hardware level and that was PA-RISC. PA-RISC in fact, was
reasonably performance competitive which suggests to me that--certainly it did not cost a 100
cycles per memory operation in order to enforce ordering on the PA-RISC processor. So in fact
the theory here doesn't seem to have quiet messed with reality either. The other interesting
thing that happened is that we ended up with some memory models in which it was actually
not at all clear that--the fence instructions that were provided by the hardware actually
could be used to get you back to full sequential consistency. And hence, if in fact this was
not a full sequential consistency is not implementable, it turns out you also can't implement the
Java memory models. So you can't implement some of the programming language as defined
on top of it. So the example here illustrates why this might be an issue. This is, by now,
semi famous example here, I guess called--referred to as the IRRW, Independent Reads of Independent
Writes example. So the idea here, this looks kind of complicated at first glance, but what's
going on here is threads 1 and 2 are just our independent writes to different shared
variables. And thread 3 and 4 read them in opposite order and convince themselves that
the other one is set first. So, thread 3 reads things in such way as to prove conclusively
that X was set first, and thread 4 read things in such a way to prove conclusively that--that
Y was set first. The question is can this happen? The problem is that it's not complete--it's
intuitively not at all clear that just by inserting fence instructions here, in the
middle of--between the two loads of threads 3 and thread 4, that as a result we preclude
this sort of outcome because one could still intuitively think of a machine where thread
1 and thread 3 are somehow located close together, for example the threads on the same processor
core. And thread two and thread four are located close together so they can see each others
writes first. And just inserting a fence there that prevents those loads from being reordered
and sometimes doesn't change much of anything. Yet this--this program is fully fenced. We
have fences between every pair of instruction. So it's not clear that we can do anymore either.
Okay. Going on here, the good news is that, this has actually gotten significantly better
recently. If we look at the status now, as I mentioned earlier, Intel and AMD in fact
published much better descriptions of the--the memory model of those--those processes which
clarify a lot of those issues. So in particular, if we look back in example on the preceding
slide, the IIRW example, it actually turns out that on X86, we can get sequential consistency
for that example. On the other hand, somewhat surprisingly, which I think nobody new before
the specs came out is that this actually requires that you implement the store, the stalls in
the first two threads into X and Y using in exchange instruction. So--so, actually the
good news is that we have a solution, the bad news is that I suspect we still have a
bunch of JBMs out there that in fact they're technically broken and don't follow the rules
because they didn't know what the rules were at the time at which they were written. In
general, on other architectures, similar things are happening and we seem to have fairly--we
seem to have fairly good stories pretty much everywhere as far as I can tell. They either
have fairly good stories or we've convinced the vendors that they have to do something
about it, so. Okay. Moving up a level in the stack here to--to languages and compilers.
The programming rules, I will claim, the programming rules for dealing with threads in general,
have been pretty unclear in some languages more so in others. Probably the--the language
that sort of in the best--that's in the best shape here is Java. The Java memory model
which defines basically the semantics of shed variables in Java, was fixed on 2005 and it
turns out there's some on-going discussions about some of this issues here, and in fact
the issue that I just talked about, about implement ability on various architectures,
was one of those that came up since then. So things haven't completely settled, but
there is a mostly consistent story there. I'm--if we look at .net and openMP, I think,
my personal opinion is that there's a lot of clarification that could be--could be used
there. And for the rest of this talk, I'll actually, as I hinted not too inconspicuous
in the title; I'll focus mostly on C and C++ where the assumption so far is--has been,
since these languages don't explicitly talk about threads that I am programming with the
base language plus the thread library, generally for my purpose is [INDISTINCT] because that
sort of has--it has this--is that's backed by a standard specification. So let's look
at C and C++ with its programming rules. The general rule, I think probably everywhere,
though I'm--certainly that's the case now, in the past that was--not completely clear
on--on Windows platform I think. But in general the rule is that we disallow simultaneous
access to shed variables, what's commonly referred to as the data case. So, we don't
get to access the same variable from 2 threads simultaneously if at least one of them is
a write. That's, for example, the rule if you look at the--the POSIX standard. Unfortunately,
many people who program with threads and C++ don't look at the POSIX standards, so. But
it's fairly clear that that was intended to be the rule. This actually is a rule that
goes back fairly far--fairly far. It was basically what was--what was prescribed in the 883 standard
and it was explored in the [INDISTINCT] thesis after that. It turns out that this solves
a whole bunch of problems. By just insisting that programmers not have data [INDISTINCT]
in their program, basically you can tell whether a lot of nasty stuff is going on under the
covers. So you can not tell, for example, whether your compiler reorders loads and stalls
underneath, because any program that would be able to detect whether two stalls were
reordered for example, would have to--would have to read those essentially at the same
time without synchronization. So any program that does that would have some execution with
the loads and the stores go on--occurs simultaneously so it would have a data race. So basically
it gives the implementation license to cheat without getting caught. So this happens--now,
this applies not only to compiler reordering, it also applies to the sort of hardware reordering
that I talked about at the beginning. So if the compile--now, the hardware is allowed
to delay store instructions by storing to the right buffer because you can't tell. Any
code that can tell is wrong, by definition. It also has the interesting consequence which
simplifies compiler writer's job significantly, is that the C or C++ compiler may in fact
rely on the assumption that the variables don't change out--don't change--asynchronously
change out from underneath it. This, in fact, potentially has some fairly profound consequences
which I think a lot of people don't appreciate. Though it seems to rarely encounter--this
seem to rarely occur in practice which is a common issue here. So here's an example
of the kind of assumptions that a compiler is currently allowed to make and what can
happen if that assumption in fact, turns out to be false. So we assume--we have some unsigned
integer variable X, and then we check that X is less than three. And then after doing
some other stuff that doesn't touch X, we execute a switch statement on X. Let's assume
that that switch statement is implemented by indexing into a branch table and branching
indirectly to one of the locations, depending on which case we're in. Since we previously
checked--checked that X is either zero, one, or two, the compiler is allowed to lie the
range check on the branch--on the indirect branch to the branch table and just assume
that that's within range. And that works so long as we follow the rules. Now, what happens
if in fact somebody doesn't follow the rules and asynchronously changes X between the conditional
and the switch statement? But what happens, say X is changed to 5 there by some other
thread in the middle? What happens is that the switch statement will merely go ahead
and use X to index into a branch table, indexing off the end of the branch table and branching
to whatever location is there causing some weird stuff to happen, basically causing the
program to take a wild branch. So the important thing is that, in this model, by introducing--by
introducing a data race, were we won't allowed to have one, we get something which in fact
doesn't correspond to just leading a different value from the shared variable. We get complete
non sense at the end. And that's allowed by current standards. So, what's the reality
here? As far as I can tell, the common attitude is that data races in fact, aren't so bad.
And that some data races--well, in fact there was a paper published in [indistinct] fairly
recently about the benign data races. And so, I mean, clearly in some cases people believe
that data races are okay. So, for example, the kinds of things that people commonly do
is that they maintain approximate counters in ordinary shared variables by just--by incrementing
the counter in multiple threads without any, sort of locking a synchronization. And then
reading the counter is synchronously--since these are approximately counters that only
supposed to be statistically valid, the assumptions is that, "Well, this may go wrong every once
in awhile because you have a race so you might actually lose one of the updates." On the
other hand, that's not so bad. Another fairly common idiom that people use is something
called double check locking. The idea here is that you lazily initialize a value and
then you--you avoid doing the expensive synchronization on every access to that--to that variable
by checking the--a flag that says "has this been initialize ahead of time" and only doing
the synchronization if it appears to have not been initialized. It's another one of
these things that--that mostly works as a fairly common idiom. On the other hand it
involves the data race. In many environments, people actually combine this with atomic operations
that are somehow provided in platform-specific ways. So on Windows, there's this interlock
operations. I already mentioned GCC has this--has _synch operations which are often used for
reference counting. And these things are often combined with reading the resulting value
using ordinary, sort of--using ordinary variable operations, using an ordinary variable reference.
All of these really isn't particularly well defined at the moment, and read access is
generally appears the data race to the compilers. So you have to worry about things like, what
happened on the preceding slide where you ended up taking a wild branch happening. Fortunately,
you know, this doesn't happen very often in practice, so we rarely get caught by this.
On the other hand, it seems fairly clearly against the world. On the other hand, we have
the problem that locks are expensive enough that they're--you can't quite tell people
does not--not to do this, properly synchronize your program because--especially for something
like reference counting in C++ which is fairly prevalent. The cost of doing this any other
way is in fact is--is too high. Okay. Here's an interesting complication which--which always
enters, very often enters the discussion at this point when I bring this up, and that's--well,
if you share these variables and they can--they can change asynchronously, I--shouldn't you
just declare the variable volatile? And the answer here, unfortunately, is sort of, it
depends. So, if we look at the current Java definition, after 2005 at least, then the
answer there is fairly clearly "yes". For something like double check locking or and
some--some other idioms, in fact that works just fine. So in Java, if they have an initialization
flag there called X and it and I declared volatile, I then assign the X and set the
initialization flag, that in fact guarantees that X is set before--that X is visibly set
to other threads, becomes visible to other threads before X and it becomes visible too
in other threads. So if I check X and it in another thread and then read "X, I'm fine."
This is often implemented by--by putting a fence--a fence, a hardware fence instruction
before the volatile is stored there. And there's other stuff that's required to--to implement
those. I want to point that out explicitly because there's an interesting contrast here.
In ConTask, if we are programming in OpenMP, here's a quotation from the OpenMP 2.5 spec
and I think there's actually a small and unmodified in the 3.0 draft. A reference that modifies
the value of an object with volatile qualified types behave--type behaves as if they where
a flash operation on that object at the next sequence point which essentially means that
in OpenMP, the implementation shouldn't show defense after the assignment to X and it if
I declare X and it volatile here and which I haven't yet figured out why it says that.
As far as I can--its--I've had a lot of trouble figuring out any used case in which that's
actually what you wanted to say, but nonetheless it's clearly not consistent. I am--if we look
at the--see where [INDISTINCT], the relevant quotation here, because it makes it clear
rather than what's actually in the standards as a quotation from Dave Butenhof was one
of the people who wrote the standard there originally. Volatile provides no help whatsoever
in making quote that safe. And if you read the standard carefully--carefully that's what
it says. Not--not as clearly but... So we have a bunch of different--really different
situations here. So, what's actually in this--in the C++ 0X Working Paper? Basically, again,
largely thanks to Lawrence here, concurrent access to--to variables is allowed but they
have to be declared especially as atomic objects. So, for example, you can declare a variable
as having an atomic int--as being of type atomic int and then that provides operations
to concurrently access the variable safely. And for all purposes, the concurrent access
is to an atomic int are no longer viewed as a data race. That's the way this is--this
is dealt with formally. Other than that, the C++ Working Paper actually doesn't make much
change to what's actually--what's currently the official status quo. It again reaffirms
that there are no benign data races in C++. So where--wherever you have a data--the other
way to say that is, wherever you have a data race, you should be using atomics. Otherwise
the behavior is undefined and that includes talking wild branches. It's important to be
explicit here that C+--the C++ volatile qualifier in fact, still has of no help here, so we
haven't changed that either. There was a lot of discussion about changing this, it turns
out it's very difficult to do because C++ volatile in fact, has some current semantics
which would be difficult to preserve if we also used it for this purpose. So that's why
the decision was made to instead invent something new, call it atomics rather than volatiles
in C++, even though that introduces this unfortunate translation problem between C++ and Java,
in that Java volatiles are all C++ atomics, and C++ volatiles are not Java volatile. And
it turns out there and in fact, some other, more technical problems here that come up.
So far we said--we've told the programmer to not write code that includes data races.
On the other hand, it turns out that compilers in fact, sometimes introduce code that adds
data races which is--is rather unfortunate. There's one sort of fairly well known case
of when this happens, and that's when update--when you updating a small struct field. And I'll
have some detail--I'll give you a detailed example of that. It turns out this can also
happen in more complicated cases if--if clever optimizations result in speculative stores,
basically. If they result in writes, back to locations that weren't written in the source,
the compiler will have to be careful that what's its writing back is what--what it thought
was there before. On the other hand, in a multithreaded application that's--it's not
safe to do that necessarily because by just writing back what was there before, you maybe
hiding an update in a different thread. Very occasionally, this actually does happen and
the annoying thing is that if you look at the current standards, it's not at all clear
that this is this disallowed at least. So, for the struct field update example, which
is probably the one that's been--the most people on practice. And assume we have--consider--well,
consider the structure up there which is sort of a particularly tricky case which is in
fact by the current working paper, miss-compiled by nearly every compiler. So, what happens
is I have a bunch of bit fields in between two character fields in 32 bit struct. And
I assigned one of the bit fields in one of the character fields simultaneously. The way
that's normally compiled these days is, say on a, particularly on 32 bit machines, but
probably also on 64 bit machines, is the--is I copy the structure as a whole into a temporary.
I implement the bit field update by updating the bit field in the temporary value, and
then installing the result back to the whole structure. On the other hand, the character
field update in the other thread is implemented simply as a byte store, so that's fine and
uncontroversial. It's the bit field update there in thread 1 that's controversial. And
as I suspect, you--most of you already realize this isn't going to work, so we can actually
step through this here quickly. So if we look at how this is actually--this can actually
be executed, we copy X into a temporary, we update the C bit field in the temporary. Now,
we assign to the D field in the other thread, and we copy the temporary back to X, and we've
just overwritten the update to D. So what happens here, as a result, I've lost the effect
of thread 2. So, as I already said, this behavior is actually something that's both allowed
and fairly common. If you look at current standards in fact, it's worse than that. This
is what implementation typically do. Implementations typically run into this problem with bit fields.
The standard actually allows this to happen even if I'm just writing to fields A and D
in the structure. Yeah? >> Suppose that X, instead of being a structure
with bit fields, was, let's say, an array of X.
>> Yeah. >> And let's suppose thread 1 wrote to a race
of zero. >> Yeah.
>> Thread 2 wrote to a race of 22. >> Yeah.
>> No one would expect those three atomic because X is not atomic but writes to zero
and 22 would be, so if you don't expect that when an X is an array, why would anyone ever
expect it to be true in this case? >> Oh, okay. I--the question if I can summarize
it correctly, is why would you expect this to work because X is not declared to be atomic?
>> Everything [INDISTINCT] right? I would never expect that to work for the same reason
I would never expect the array case to work. >> Okay. In fact, I am not sure I quite understood
the array example you asking about simultaneous updates to two array elements.
>> I did not mean to blow up the talk it is just that, it--it...
>> Yeah. >> It is just, I do not--I do not think...
>> Oh, okay. No--no. That is--this is a good question. I think it's good to be clear about
this. In fact if you look at a lot of existing code; it makes assumptions along this lines
usually it is in the form of--I have a structure and it has, say, these fields in it and I
would like to protect C and D by different locks.
>> ...so [INDISTINCT] proposal. >> Just one rule so far or a slight [INDISTINCT]
never use [INDISTINCT] how about that. >> I...
>> I mean, it might be to follow that rule now.
>> ...I, in some applications that--in some applic--well, I actually, let me postpone
that and do--what we do is actually some way in the middle which I think is--is probably
a bit--well hopefully you agree it's a better solution but...Okay. So here's a quick summary
of what we actually say about all of this in the current working paper. So subject to
the no data races rule and--and in response to that question, I should be clear that the
no data races rule applies at the level of scale of values so you may not simultaneously
access two scale of values at the same time from the different threads. So, in particular,
this--that does not automatically prevent simultaneous--simultaneous access from two
threads to distinct field--distinct array elements. Subject to that, well, each update
affects a memory location which the standard defines as either a scale of value or contiguous
sequence of--of bit fields. The standard then defines exactly which assignments can be seen
by each reference to a memory location which is essentially what you need in order to define
the semantics of--of shared memory here. For ordinary accesses, there has to be exactly
one assignment to a particular memory location that's visible to a particular access, to
a particular load. The way this is structured is if we look back at the preceding example,
a reference to X start D, I--after completion of both of those threads and the last example,
in fact, as required to see a value of one. And because those two threads access distinct
memory locations, they, in fact, are not defined to be involved in a data race. So as a result,
the preceding implementation of bit field assignments as I outlined it there is incorrect.
So we no longer allow that implementation. Okay?
>> Is this supposed to address, you know, 32-bit values on AFid processors or anything
like that or is that going to go? >> Thirty two bit--thirty two bit values on
AFid processors in fact, work out generally okay because of the fact that we don't allowed
races. So that--that's in fact not an issue. >> Okay. I am fine.
>> We do not. There--there's an issue in the implementation of atomics on such processors.
>> My question like this... >> For non-atomic speakers, we do not allow
simultaneous access with out some sort of synchronization between them. You can't tell
whether a particular 32-bit stores implemented in four pieces or in one.
>> Okay. Good. >> So, as a result of this, if we--the proceeding
implementation is not okay. On the other hand, assignments to the--the bit fields in the
middle, in fact, may not occur concurrently. Those--those two forma data race if I have
one [INDISTINCT] assigning to bit field X stat B and one to X stat C, and hope--hopefully,
that answers to the last question. So that's sort of a compromise solution. You can still
use bit fields but they have weaker semantics then you might otherwise expect. But it only
affects bit fields, it does not affect the adjacent character fields. This will actually--it
turns out has some other more subtle consequences which are of interest primarily, the compiler
writers. I mentioned earlier this notion of speculative stalls introduced by compilers.
So here's an example of the kind of optimization that--abstractly, you would kind of like to
see in your compiler, on the other hand, it's not one that we allow anymore. So, this is
a simple loop that traverses a link list and counts the number of positive elements in
the link list. I am--it's fairly common to transform this to a loop that initially--I
should say--I should emphasize here, the account on this particular case is a global variable
that might be shared between threads. It's fairly common in current compilers to optimize
this to code that loads the initial value of count and to register before the beginning
of the loop. Then traverses the length--then executes the loop that traverses the link
list just--in commanding the register value, and then installing the register value back
to the global at the end. This transformation is not technically safe with more than one
thread because if, for some reason, I know that I'm invoking this on a list with no positive
elements, I know that the original code never touches count. So there is in fact, no data
race involved if I run this concurrently with another thread that does update count. On
the other hand, the transformed version always stores to count. So I've introduced--the compiler
potentially introduces a data race by performing this transformation. There are ways to avoid
this in this example, so in this particular example, you can--if you perform the store
only if--only if the register value is non zero, that will generally take care of the
problem--sorry, actually you have to do this a little differently. You have to count in
a local variable; you have to initialize a local variable to zero. And then if you change
its value from zero, then you store--then you add it at the end. That will work. Okay.
So the consequence of this is we are losing some useful optimizations. On the other hand,
it does give the programmer a much--a much simpler and consistent story than what we
have before. It prevents some really mysterious compiler-introduced bugs. And it outlaws,
primarily optimizations that we can at least replace by code that's as far as we can tell,
is only slightly slower. So I don't think this is going to have a major impact on code
performance. So that was, sort of at the low level. Here's another interesting problem
actually. Another one that Lawrence has been spending a lot of time thinking about. This
one is--this one is, unlike the previous one, its C++ specific. So what's happening here
is fairly commonly what--I think it's fairly common to write code which at some point doing
its execution starts some demon helper thread. So we have a main thread running. At some
point, the helper thread here stop it. And at some point then the main thread exits.
In a C++ program, when the main thread exit--exits various static destructors, destructors for
statically allocated variables get invoked and go away. So at some point, say there's
a variable down here which is used by this--by a shared library that's referenced from the
demon thread. The shared variable that's used by this library goes away, because it's destroyed.
Now the problem is, in this model, the demon thread is still running. There's nothing to
prevent the demon thread from accidentally waking up sometime after that and trying to,
call in to that shared library now accessing a very--one of the static variables of the
library that has already been destroyed at that point. As far as I can tell, this a fairly
common scenario in multithreaded C++ programs which I suspect hasn't generally been identified
as a major problem because processes only cache when they're about to shut down anyway,
and nobody notices that that much. But nonetheless, it's not a good state of affairs. For this
particular one actually, we don't have a, what I would consider to be a complete solution.
The current state of affairs in the working paper is I, we either--you basically have
two choices; you can either shut down all threads properly before process exit. That's
not necessary an easy thing to do because some of those threads maybe waiting on I/O
and you don't have a very good mechanism for getting them to break out of the I/O wait.
All--the other alternative is in, this is Lawrence's invention, there's a way to exit
the process by only executing certain special clean-up actions and not distracters. Not
all distracters, at which point you sort of get a semi graceful exit which is guaranteed
not to cache however. On the other hand, some of the things that you expect to be cleaned-up
might not be cleaned-up. Okay. Another problem here. Shared libraries. Part of the story
in making all of this work is that, in order to prevent races in programs, we need to have
synchronization operations like lock and unlock in this slide that ensure mutual exclusion
and so on. And we would like them to--to prevent memory reorderings with respect to those synchronization
operations. So if we look at this particular example here, this basically just commanding
X inside--inside the critical section. We suddenly want to make sure that the implementation
of this synchronization [INDISTINCT] such that we can't just take, say the load of X
into the tenth variable and somehow we reorder that out the critical section. So we want
to prevent this reordering from happening either by the compiler or by the hardware.
There are a couple of mechanisms that are normally used for this. We normally prevent
this reordering by the compiler by neglecting to inform the compiler what lock and unlock
are at all. So they look like a fake function to the compiler. And therefore it looks like
the compile--to the compiler, like lock and unlock might touch X, and therefore you can't
replace--you can't do this sort of movement of the memory operation because that looks
to the compiler already like it might be incorrect. On the other hand, the other part of this
has to be that we have to prevent the hardware from doing the equivalent of that. We normally
do that by putting fence instructions into the implementation of lock and unlock. So
clearly this is disallowed. These fence instructions are fairly--typically somewhere between fairly
and very expensive. So we need to be careful that we understand exactly where they're needed.
On the other hand, so far, actually, it's been fairly unclear what kind of ordering
is actually--what kind of the reordering is actually allowed and therefore, what kind
of fence instructions we allow. So, if we look at Java, for example, it's fairly clear
we're not allowed to move things out of the critical section as one I might expect. We
are allowed to move things into the critical section in both directions. If we look at
the Pthreads standard and read it very naively, basically, what it seems to say is this that
we can't move things either way, these things just prevent reordering and cost them. If
we study that really hard and think about what it really means, you can prove that you
can do this and the result is not observable. It actually turns out you can't 1do that.
You can't move things in pass the lock. And the reason is this example. And as I said,
"Don't ever write code like this." So the idea here is that I set some variable, X equals
42, and then say--say I am done with accessing variable X by acquiring a lock which is some
of the opposite of the way things is supposed to work. I then go ahead and wait until the
lock is set, which I can do by calling "tie lock" repeatedly. And then I claim that, at
this point here, it should be case that X is 42. The problem is, this should work, this
is expected to work in some sense at least with sufficiently [INDISTINCT] or something,
but it breaks--if I reorder the assignment of X equals 42 and the lock acquisition. I--so
that means that in this particular case, essentially I can't move the assignment of X equals 42
into the critical section pass the lock. So what do actual implementations do? Well, we
looked at some open source implementation a while ago. This is--I think fortunately,
this isn't also probably improved. So, a bunch of implementation--on weekly ordered hardware,
NPTL locks tended to, in fact, assume that we can get implementation here. So they in
fact tended to--to allow movement of things into the critical sections in both directions
but not--but not out which in practice is probably okay, but technically incorrect.
I'm--some of them, I think often, just because the hardware, sort of gets stronger, guarantees
anyway, implement this where nothing moves either way. In some cases that's probably
true. That's stronger than required. I think on [INDISTINCT], this is probably too slow.
Some of them did the right thing. And we found one interesting case that did that which you
don't want to use. I think that's about fixed. Okay. C++ erects a solution to this. It's
slightly weird. So movement into critical section is allowed in both directions. So
how do we deal with that funny example that we didn't want anybody to write? Well, we
tell you that tie lock may in fact fail spuriously. Which turns out--it makes the example that
I have up there clearly incorrect. So it still behaves "I no longer have a memory consistency
problem, I just have a wrong of code." Which actually was the goal here, I claim. Because
this is, as I said, this is not coded, you should be writing anyway. And I know of no
cases, but I'm told that there are. Some really obscure cases where this actually interferes
with legitimate code on uses of tie lock that I've ever made continue to work under this
interpretation. So, together with all the other rules, the important point from the
slide here is that by--by using this slightly weird definition, we basically allow the standard
to get emptied that memory operation reordering is invisible--is completely invisible for
the program--to the programmer for data race reprograms. So even if you use tie lock or
whatever, you don't have to--the bottom line is you don't have to worry about all this
reordering stuff that we've been--that we've been talking about here. That's not quiet
true because in fact there are some low level library facilities that actually allow you
to write really fast hand tune code that violates this rule. However, that's very clearly identified
when you're doing that, and so long as you don't do that, you get that guarantee. So
without doing that, basically, data race reprograms behaves as though the steps are just [indistinct]--of
the individual threads are just in the [INDISTINCT]. Let me see. I have a couple more quick slides
here. Libraries, in general, they don't steal specifically with threads. I think there's
also been some missteps here. The general wisdom about locks and, for example, general
purpose container libraries, I think its safe to say is that if you're accessing a general
purpose container library in multithreaded code, the client knows whether or not that
container's being shared between threads so the locking should occur in the client. This
doesn't apply to things that specifically are designed for threads use. Their--the rules
maybe different. If you don't do that, you get the--we have performance problems and
possibly deadlock. And there's a lot of experience to substantiate that at this point, I think.
On the other hand, if we--if you look at what we actually have, its sort of mixed and pop
because people were learning at the time. So the original Java collections, like vector,
are synchronized implicitly and violate this rule. And on the Posix side, in fact we have
this problem as well. There's this interesting phenomenon that if you write a simple file
copy program using--using put C and get C in Posix and tie this on your favorite Linux
machine, if you fork an empty thread at the beginning of this copy program, which immediately
returns and does nothing, it slows down by about a factor of 10. That's because Posix
requires that put C and get C implicitly acquire and release a lock. So they're [indistinct]
safe and not very strong in usually unnecessary sense and tells us that behavior is triggered
once on Linux, at least once you fought the initial thread. So, the C++0X solution here
is that we basically follow what's sort of the STL convention in the recent Java convention
which is that containers are not seem--do not acquire--do not visibly acquire a lock
by default. The details here still have to be determined; we're not really done with
this thought yet. So where does this leave us? The one thing that--this has taught me
at least, is that if we start out with a single threaded language and other threads as and
afterthought, that, in general, doesn't work all that well. Its--I think we really do need
to get the remaining problems fixed as--that we introduced as a result of this. Its quiet
likely that there actually is a better general purpose parallel programming model, at least
for certain application. I'm not sure there's anyone that's quiet as general. On the other
hand, I think it's useful to be able to fix some of the threads and locks programming
model underneath if for no other reason then there we can actually get a reasonable evaluation
of the alternatives as a result. And if you look at these alternative programming models
which generally underneath in the implementation as threads and locks anyway, so we better
get that right. Okay, thank you. Any questions? >> I know--I know in Java there's a concurrent
[INDISTINCT] which basically tries to put things into separate HashMaps in order to
improve performance once being accessed concurrently. Where does that fit into this idea that the
client does all the locking, because it does not have visibility into the internal implementation
of the data structure? >> I was--time to be fairly careful to--to
weasel out of this but initial. . . There are clearly some data structures that we'll
need to know about threads specifically, and that's an example of that. That are specifically
designed for multithreaded access. So in that particular case, it's not a data structure
you should ever use if you only have single threaded access. So the idea is--the interface,
basically, is designed for concurrent access and as a result, the implementation can take
advantage of that. That certainly makes sense. So I'm all in favor of providing that sort
of facility in addition to the basic containers. >> Earlier in the talk, you said certain kinds
of race conditions are common, but failures from them are fortunately rare. I was a little
surprised to hear you use the word "fortunately". I've been writing concurrency bugs for several
decades now, and I am always happy when they show up early and sad when they show up late.
>> I--I actually agree with you. I was not being entirely serious when I said it that
way. >> I wonder if there are any facilities, contemplated plans, or recommended for improving
the things you didn't want to do but did anyway. You know, when I write those concurrency bugs
for the hundred and first time to help--help me find them sooner. Are there language features
or processor designed features or programming practices to help?
>> We've been thinking about this a little bit. I think it's clear that in the eventual
story he has to invoke a lot of help from tools. And some of those are out there. We've--whether
we had a choice in some sense, we've tried to make things a little bit easier for the
tools. So the sort of prohibition absolute prohibition against data races here in some
sense makes things a little bit easier for data race detection tools because if you can
detect that you actually have a race, you have a bug, there is no fault, no notion of
a false positive in that sense. I--but I think that mostly an orthogonal issue. It's an important
issue but I don't think we have any great answers there yet.
>> Fairly early on, one of your first slides said that some of the objections to threads
were that they're dangerous and I forgot, you know, "threads are evil" is the thing
you put up there. You haven't been particularly considering the previous question. You haven't
done anything to convince me of otherwise yet. I still don't see any reason to trust
anyone in this room, particularly including myself to write threads correctly. I mean,
not implement them but to use them correctly. Is there any--is there anything going in the
C++ standards to help address that, to help up try to trust more people to use this incredibly
dangerous tools? >> I think there's a consensus that there
is a serious problem there and that probably most programmers should be using higher level
libraries that are built on top of these. So the problem at the moment as far as the
C++ committee is concerned is primarily one of timing. And that's we were under a relatively
short deadline and we have to get sort of, we have--hopefully enough time to get the
foundations right. We unfortunately don't currently enough time to get those things
in but everybody realizes that those are important as well.
>> I just want to address that last question, too. It is true that treads are still hard
to use and evil but it--if you don't know what your treads mean, then it is completely
and utterly impossible to write correct code. Not just merely hard. So this is--this is
a really important [INDISTINCT] to actually to find what this--what it means when a threads
actually executes. Thanks. >> Any other questions?
>> If I could get there. >> Oh, okay.
>> Which libraries are you or would you like to see implemented but obviously don't have
time to implement? >> There are--we haven't invested as much
effort into this I would like. It's sort of the kinds of things that have been under discussion,
probably only--only, sort of one level up and maybe not--not high enough. I mean, we've
been looking at threads pools and so there--threads building blocks kind of--kinds of approaches.
So Intel has proposed the threading building blocks library to the standards committee
and that was sort of pushed back for timing reasons. I think that's probably not high
enough. I mean there are lots of--lots of small things that need to be added as well.
I'm just--there's some interesting things we keep stumbling over just like getting reference
counting right. It's really hard. Write and really have them optimal is really hard and
that's certainly something that should be provided in the library. And there are--there
are bunch things like that. >> So my question is about shared variables
and can you make them more safe by disallowing things like address taken on that? So that
it is easy for analysis to catch bugs. >> We haven't--we haven't really been looking
at disallowing things like that. I mean, certainly you can--you're right. I mean, you can make
the analysis easier by not using those features. I haven't heard much of an argument to actually
disallow them in the language. I think it would be problematic to do that at this stage,
but unfortunately. . . >> [indistinct] make analysis much more usable?
And once it addresses on--on shared variables, anything [indistinct] on.
>> Yeah, on--on... >> Its hard--hard to track them around.
>> On the hand, I mean, you truly can certainly detect violations of that rule, and there's
nothing to prevent you from warning violations of that rule.
>> Right. >> So you're free to do that in anyway in
some case--in some sense. So I'm--I would be worried about trying to convince people
to actually restrict the language in that way because its very hard to convince the
committee to outlaw something that's in fairly widespread use and particularly, if you can
work around it somehow. >> It seems like the Java world has spent
an awful lot of time putting things like concurrent containers classes and that sort of stuff,
into their standard library. Do you ever see that happening for the C++ standard? And if
so, how many years are we talking? Is this, you know, five years, ten years?
>> Its actually conceivable that this may happen shorter--in shorter time period. I'm
not sure. I have a--probably Lawrence, who's sitting there, there's more experience of
predicting how the committee behaves. But there is an effort to put out another library
technical report which is, sort of a non binding addendum to the standard on a fairly short
time scale. In fact, at some point there was discussions of--there were discussions of
possibly getting that out earlier than the actual next standard because there fewer bureaucratic
hurdles that has to go through. So it's conceivable that it might happen in the--within the next
two or three years. But I'm not very good at predicting this things.4084sec
>> Historically, the compiler development process usually involves something like, right
a compiler, convince yourself its close enough to write, send it out the door, somebody finds
a bug, sends in a bug report. Certainly, one of the problems with concurrency related bug
is they're very often unpredictable and it's hard to put your finger on the bug. And I
wonder if you have recommendations about things a compiler writer can do to help verify that
they're following the concurrency related rules to reduce the number of this area hard
to pin down errors that the compilers are introducing.
>> I think that's an interesting problem. It's probably also still a research topic,
unfortunately. I think some other people have been working on building infrastructures for
testing concurrent programs to try to, sort of increase the cost section these bugs. But
I don't know of any good solutions. I mean in some sense--in some cases we would trying,
to sort of minimize the number of possible un-testable bugs. So, I mean, the absolute
prohibition against data races, again, is probably a good thing because my feeling is
that current compilers makes the assumption that there are no asynchronous changes. And
those assumptions, sort of rarely and in fairly unknown context, actually impact the generated
code. So even though this--violating that assumption would rarely, actually cause the
sort of behavior that I illustrated within switch statement. I think testing for its
absence would be absolutely helpless. But, I mean, other than that, I don't have any
great ideas. It's still a hard problem, I think. Okay, thank you.