Follow US:

Practice English Speaking&Listening with: Lec 1 | MIT 6.172 Performance Engineering of Software Systems, Fall 2010

Difficulty: 0

The following content is provided under a Creative

Commons license.

Your support will help MIT OpenCourseWare continue to

offer high-quality educational resources for free.

To make a donation, or view additional materials from

hundreds of MIT courses, visit MIT OpenCourseWare at

PROFESSOR: Welcome to 6.172, Performance Engineering of

Software Systems.

This is a fun class.

It's all about going fast.

We all like to go fast--

well, not all of us.

But everybody in this class likes to go fast.

Otherwise you won't like the class very much.

It's all about making stuff go fast.

So a couple of brief announcements.

We're going to be videotaping for OpenCourseWare.

So if you don't want to be videotaped, you should sit it

in a place where the camera won't catch you.

Because we're in such a large classroom, I'd like everybody

to sort of every day, without me having to draw--

I'm not really an artist.

My sister's an artist, actually.

She's a painter.

But maybe that gene escaped me.

It went that way and not this way.

So I don't want to have to draw it up again.

So everybody, try to confine yourself to this sort of

quarter of the classroom.

And that way, it'll feel more like a size room that

we ought to be in.

Let me go to our course information sheet.

Did everybody get a copy of the

course information handout?

So we're going to the class-- we're going to start at the

beginning with the Administrivia stuff, try to

run through that as quickly as we can.

And then my co-lecturer, professor Saman Amarasinghe--

if you can't pronounce or spell his last name, you can

just say "Saman"--

will give today's lecture.

And we have four teaching assistants.

So teaching assistants, please stand up.

They are Josh, John, Reed, and Eric.

And let me tell you, what they have done in the last few days

to get us ready for this course is nothing short of


So you really need to accord them your respect, and they're

going to really work hard to help you folks do

well in this class.

We have, I think, a pretty decent ratio.

I think even with concerns about cutbacks, et cetera, to

have four TAs for this is really good.

And we need it, because it's a software class.

And software is, as you know, bugs can take an arbitrarily

long time to find.

You guys can sit.

Besides our teaching assistants, we have two

support staff people helping us, Mary McDavitt, who works

for Saman, and Marcia Davidson, who works for me.

And they're generally pretty good about being able to track

us down if we're in some far corner of

the world or something.

Our lectures are going to be held on Tuesdays and Thursdays

in this room, 2:30 to 4:00 PM.

You should plan to attend regularly.

We dispense all kinds of wonderful gems that you will

miss if you think you're going to get it just from reading

and even looking at videotapes or whatever.

I'm not sure what our arrangement is going to be

about having videos available.

OK, they're not going to be available, except unless

there's something really extenuating circumstance.

You have to go into a hospital to have your forebrain removed

or something--

something serious, OK?

All material in the lecture, even if it's delivered oral,

is fair game for projects and quizzes.

We will post the lecture slides, but they don't

generally convey all of the information.

And we're going to be

video-recorded, as I mentioned.

If you don't want to appear, sit behind the camera, and you

won't appear.

We have no recitations in this class.

However, the teaching assistants will periodically

hold primers where we will bring you up to speed on

various technologies that you may not be up to speed in,

such as C. How many people have programmed in C before?

And how many have not programmed in C before?

A few.


Tools such as VTune.

Who's used VTune before?

Who has not used VTune?


Pin-- who's used Pin?


Who has used Cilk++?

OK, a couple people.

So you see, we're going to have some primers to bring you

up to speed in those kinds of things.

We don't expect that you came in knowing any of those

things, particularly.

We're going to be using Stellar for our course


And here's the URL.

And you should make sure that you're registered as a member

of 6.172, because otherwise, you're not going to get any


We're going to do all of our

administration through Stellar.

So make sure that you're registered there.

At the end of today, I think it is, we'll send out an

announcement to the student mailing list.

If you don't get something by the end today--

so you wake up tomorrow morning, you have no email

from the course staff.

What that means is you should email us.

AUDIENCE: Well, that mailing list would be the

preregistered ones.

PROFESSOR: That'd be the preregistered ones.

And then that way, we'll get all the people.

So if you know you're not preregistered, you can send us

email before then.

There are two one-hour quizzes given during class time.

AUDIENCE: One and a half.

PROFESSOR: What's that?

AUDIENCE: One and a half hour.

PROFESSOR: Oh, you're right.

It's one and a half hour quizzes.

Yes, two one and a half hour quizzes.

We'll correct that in the online version.

Given during class time.

Can somebody, by the way, correct this as we go along?


The dates are on the course calendar, which is available

on Stellar.

We've had mixed results with the ICS feed from Stellar, but

you're supposed to be able to download the calendar to your

iPhone or other ICS-capable devices.

But we were successful in downloading, but all the times

were screwed up.

But anyway, hopefully, maybe they'll allow that.

But anyway, the calendar is there.

The quizzes will generally be closed book and closed notes,

but you will be permitted crib sheets, one sheet that you

summarize anything that you feel that you need to know.

Generally, everything that we cover in here is fair game for

the quizzes, including things that are in

prerequisite courses.

There is no final exam because we have a final project.

This is a fast-moving class, so if you get caught behind,

it's not a good sign, because it's really hard to work

double-hard to keep up in this class.

You really have to keep on top of things.

And so for that reason and others--

because it's really difficult for us--

it's fast moving for us as well--

we generally cannot accept late projects.

If you haven't been able to complete a project, you should

just package up your partial solution and hand that in, and

you'll get partial credit on it.

Do not just say, oh, I'm not going to hand something in,

and expect it to get it done later.

If you do find yourself in an unusual situation which you

think there are some extenuating circumstances,

like the aforementioned brain surgery or something, let us

know in advance.

So if you're going to get in a car accident, it's really

helpful if you tell us the day before so that we can plan.

Obviously can't always do that, but if it is something

for which there is any reasonable expectation it

should be done beforehand, you should let us know.

And we may require, for those kinds of exceptions, some kind

of confirmation from a dean or medical professional.

So if that's the route that you're in, you can plan a

little bit and make sure that those confirmations

are on their way.

The grading, this is essentially approximately how

we're going to grade.

As you can see, there's a bunch of content in the

projects and a bunch in the quizzes.

And then there's also participation grade.

And the participation is both in--

we will have design reviews and such, so that's part of

the participation, as well as asking

questions and such in class.

And because we're video-recording everything, we

know exactly who's asking questions, unless you choose

to sit outside the video zone.

In addition, if there is a significant missing element to

your work, it doesn't matter how well you did on the other

things, you will receive a failing grade.

You must, essentially, do all the work.

So if you do not get substantial credit on the

final project or any other two assignments, for example, you

can expect to receive a failing grade.

It goes downhill very quickly.

If you're missing one assignment, generally don't

expect to get higher than a C, for example.

It goes downhill very, very quickly, because that's what

this is all about is hands on.

What's permissible in this class?

So it's your responsibility to satisfy both the letter and

spirit of the rules.

So if you have any questions, let us know.

Here's what the rules are.

Also, let me just say how Saman and I treat this.

We have been involved in cheating incidents off and on

over the years.

I've been at MIT almost 30 years.

Saman, you've been here--


PROFESSOR: 13 years.

So we've been here long time, longer than you folks.

We've been through this a bunch of times.

It's really a pain when somebody cheats.

It robs me of a huge amount of my time that I could otherwise

be spending on teaching and students and my own

professional development.

However, we both take cheating incidents extremely seriously.

So we have this thing that is sometimes really inconvenient

called integrity.

And I'd rather, if I find somebody's cheating, just give

you an F and not deal with it.


Instead, when somebody's cheating, I tend to want to

take it to the Committee on Discipline, where you face

things like expulsion, et cetera.

So I don't like to do that.

It's very time consuming for me.

But I find it hard to just look the other way, because I

view the issue of integrity as protecting the vast majority

of students who are obeying the rules.

So when a cheating incident occurs, I sort of feel honor

bound to push it to its proper conclusion.

So please don't cheat.

We will use a variety of means at our disposal, including

technological ones, to detect cheating.

Also, if you think that sharing your results with

somebody else means that they're

cheating and you're not--

in some cultures that's the way it's treated--


In this culture, in particular the culture in this classroom,

they're both equally guilty of academic dishonesty, whether

you are the giver or the taker.

Equally, we treat that as equal offense.

So here's the particular things.

You may not share, generally, your solutions with people who

are not in your group.

So we will have a bunch of group projects,

generally in pairs.

And that includes both people in the class and people

outside the class.

So generally, you keep things to yourself.

When you're in a group, of course, you can share ideas

within the group.

But we expect that everybody's making a fair contribution.

So you should be concerned if you are not holding up your

end of your group participation.

And we're going to ask you to describe the contributions of

the people in your group.

Generally, the group, as I say, is two.

So it's basically saying, what did you do?

How did you divide up your project?

What did you do?

What did the other person do?

You generally can't share it with anybody else.

You can't copy or transcribe solutions from other places.

The work you submit must be your own.

Pretty straightforward.

You may go out and use general conceptual material, such that

you might get from a good textbook or from Wikipedia or

someplace like that.

That's perfectly fine, resources you would ordinarily

have available.

But if you do use any material from an external source, you

should properly cite it in normal academic that fashion.

Cite where it is so that somebody else can go and find

and look to see what it is that you're citing.

If you feel that you have transgressed in

some way, let us know.

It always goes way easier.

I generally tend to be kind of a nice guy when somebody comes

to me saying, oops, verses I hear from somewhere else or

one of our tools detects something else.

Then I'm kind of cranky.

So I'd rather, if you come to me, we could work something

out, generally.

Most of your out-of-class time will spent

completing six projects.

And there's an outline of what the projects are.

They're all fun projects.

You can ask the TAs.

The TAs will tell you how much fun they are, because they've

generally helped to design and built their own

solutions to them.

Some will be done in pairs, some individually.

The final project can either be done in pairs or in threes,

I think we said.


PROFESSOR: Yes, we also have-- we didn't put that in here.

We also have a Project 0, which will be available today

after class.

So you should go to Stellar and get Project 0.

Project 0 is not to be handed in.

It's getting warmed up with all of the machines and tools

that we'll be using in the class.

We have been very fortunate to have very nice contributions

from Dell and Intel of a new Cluster facility that has 16

12-core machines.

And they're Intel Core i7 machines.

Each chip has six cores, and there's two cores per machine,

each with 48 gigabytes of memory.

So these are kind of honking modern facility.

So as I say, it's fun.

It's fun when you do something on your laptop, and then you

go and you do it on one of these machines and

it goes like this.

It's really fun when these things go fast.

So Project 0, you should start on today.

And it's basically the handout called Getting Started.

It will be available on Stellar.

And it will take you through getting access to these

machines and so forth.

In particular, in order to use these machines, you must have

a CSAIL account.

Who does not have a CSAIL account?

So that is going to be the first thing you need to do,

because you've got to get that.

So what you want to do is get

registered for a CSAIL account--

and we give instructions in there.

It's basically going to a web page--

pretty much as soon as you get out of the classroom.

Because you won't be able to start anything until you have

this CSAIL account.

And hopefully, we'll be able to give you access within a

few hours after you register.

Yeah, question?

AUDIENCE: What operating systems are the tools--


We'll be using Linux.

So there are a variety of ways that we describe for how you

use these machines remotely.

We have the Andrew File System--


which you can use so that you can edit things locally but

have an SSH terminal window on the machine, which makes it

very convenient.

On the other hand, for some people, often some dorm rooms

with poor connectivity, doing things by AFS may be

too slow for you.

But there are a variety of other ways that you can-- you

can actually edit right on the machine.

And there are a variety of ways of having a repository on

your machine, because we'll be using Git as

our versioning mechanism.

So you'd be able to have it there and be able to make

copies back on the other machine.

So there are a variety of ways to work.

And you're free to make the software run on your own

laptops or workstations.

The one thing I'll say is we will not support that.

We don't have the resources to support beyond using the

machines that we actually provide.

I think I've got to go faster here.

So let's see.

So software engineering.

So research in programmer productivity has demonstrated

that pair programming, where two programmers sit together,

one at the keyboard and one looking on, is actually faster

than two programmers coding separately.

Why do you suppose that is?


AUDIENCE: You don't get stuck.

The other person can tell you-- it's like when you're

[INAUDIBLE] writing on the blackboard, you get nervous.

You don't know what you're doing.


PROFESSOR: Yeah, you could be stuck.

What else?

AUDIENCE: You don't procrastinate.

PROFESSOR: Yeah, you don't procrastinate.

That's a good one.

Don't procrastinate.

AUDIENCE: Finding bugs.

PROFESSOR: Yeah, finding bugs.

So the main one is finding bugs, that the person who's

looking on finds bugs.

In fact, I was just pair programming with Reed the

other day, and we were trying to figure out, why is this

thing not working?

And I say, gee, it looks fine.

It says, next y equals x.

And he says, no, next y is supposed to equal y.

And it's like, we had both read over that code, but as

soon as I spelled it out, I didn't even notice.

But of course, him in hearing it says, oh, it's supposed to

be x and y and y, not x and y.

So anyway, that kind of thing, you pick up very quickly.

It makes it very easy to debug.

Likewise, regression testing demonstrably promotes fast


So this is where you build a battery of tests that includes

tests for every mistake that you've ever made.

And it can also be helpful to write unit tests, where you

test individual ones.

And you should also use tools like the assert package.

And some of these things will be taken through in Project 0.


We have recruited senior software engineers in the

region to share with you their invaluable knowledge and


We call them Masters In The Practice Of Software Systems

Engineering, which has the great acronym MIT POSSE.

These are people who are not being paid.

And they're senior.

They're doing this because they're volunteering, because

they want to help.

You have a lot to learn from these people, and we will have

code and design reviews with these people with your

assigned master.

You want to treat them nicely.

They're not doing this so that some snotty-nosed kid can say,

I don't care about you.

That's not what they're doing.

They're there to help you get a good grade.

And these are some really talented people, people who

are very famous in their own areas of expertise.

So be gracious, is the main thing I would say.

Be gracious.

Let's see.

The assignments.

You can read about the assignments.

Basically, generally, what we're going to do is have a

beta submission in which we're going to test everything.

And we're also going to take all your tests.

We're going to throw all the tests into a big pool and run

everybody's tests against everybody software.

You will lose points if you fail some of the tests,

assuming the tests are correct tests.

You will lose points if you have an incorrect test.

You will lose points if somebody passes your test and

they have a bug.

You will get points if your test picks up something that

very few other groups tested for.

So you have to both submit code and submit tests, which

is OK, because you've got to test your own stuff before you

hand it in.

But we're also going to be using it to test each other.

After beta, we will have a design and code review with

your master.

And let me just say, the master's there not affecting

your grade in any way.

They do give us reports about what happened in their

meeting, but they have no bearing on the grade.

They're just there to help.

So you don't have to view them as--

they're not part of the course staff.

If you want to get angry at somebody about how things are

done, get angry at Saman and me.

Don't get angry at them.

The final submission is generally due a couple weeks

after the beta.

And it's after you've gotten feedback on your design,

feedback on things.

We'll give you the whole battery of everybody's

regression tests, so you can test out your code against

everybody's regression test and use that.

And then generally, the scoring is such that people

who get fast code get high marks.

People who get slow codes get lower marks.

If you need help, [UNINTELLIGIBLE].

Speed is fun.

Enjoy the class.

Yeah, question?

AUDIENCE: In terms of the bug testing, I'm

thinking of bugs like--

I don't know--


or something like that.

If you happen to not have made a bug in your own program


would that count against you in the final testing.

PROFESSOR: You have to view your beta test as testing

other peoples' code also.

PROFESSOR: But if you [UNINTELLIGIBLE] a test that

finds bugs in somebody's program, and if very few

people wrote that test, then you will get points.

So you don't have to test for the world, everything.

If you fail somebody's test, but not yours, you will be

deducted points.

If your test finds other people's bugs, you will gain

points for that.

But on the other hand, if you don't have a test for--

what you said, I think you don't have to go and test for

every possible condition in there.

The negative side is, if you miss something,

we will deduct points.

Positive side is, if your test finds somebody else's bugs,

you'll get points for it.

We'll go into detail exactly how [UNINTELLIGIBLE].

Another thing we did last year, which was kind of


I think there were mixed--

some people really liked it, some people

were not that happy--

is a little bit of competitive grading.

So here, what we are doing is not the full grade--

small part of the grade.

What you do is when you first submit your beta, rerun

performance tests of everybody.

And of course, one of you are going to get the

fastest code for that.

At that point, we are going to say, OK, this

is the fastest code.


everybody else with that code?

And the person who gets fastest code for that part of

the grade gets the full points.

And everybody else will get a partial part of the points.

However, you get a chance in your final submission to

basically improve your code.

If you do better than the last time's best code, you won't

get any more additional points, because then we don't

want to create [? an arms race. ?]

So you at least had to match the previous

times to get full points.

So you get a chance to see how you stand against your peers

and get a chance to actually [UNINTELLIGIBLE] math with

your peers.

So first time we did it, we said, OK, it's the fraction of

your performance against the performance of the best code.

The problem was, there were a couple of codes that 1000x

performance improvement, and everybody got 10x.

So you've got, like, 0.25 grade.

That didn't help.

So now we actually are doing a logarithmic log scale.

So we'll get the fraction, and we'll take the log off that.

So there will be a little bit more equality.

So you will know how worse off you are, and you get a chance


Because it is very important, because a lot of times, there

might be a crucial thing that you missed.

And instead of trying to be, "aha, got you" in this class,

we are trying to give you a chance to actually learn.

So if you missed it, OK, you missed it the first time.

But you get the second chance to come and fix it and see

whether you can do well.

And of course, if you do better than the best grade,

there are bragging rights, but you don't get extra points.

With that, let me--

where are we?

So again, as Charles pointed out, I'm pretty excited about

this class.

This is a really fun class.

Unfortunately, I'm a little bit under the weather.

So I'm going to sit down.

And if I don't sound that enthusiastic, hopefully I'll

get enthusiastic later.

And that's one reason we ask you to come forward, because

we don't have to come here and shout.

So hopefully people in the back can here.

If you don't, still, there are some seats in the front.

You can move forward.

So for the rest of the class, what we want to do is go

through an example and look at why performance matters.

For that, I selected matrix multiply.

And I think your Project 0, you are going to also play

with matrix multiply.

And I think that you'll get a feel for what kind of things

will happened and how much room you have in

doing well in here.

And at the same time, as I go through this, you'll get a

feel for what's the process that you might be following

through the class and what kind of things--

because linear programming, after taking a bunch of

programming classes, probably, you have a methodology set up

in your mind, how to go about doing that.

And the same thing after this class, we want you to have a

methodology of going through and improving performance.

And this is basically a little bit of a contrite example, but

you'll get a feel.

So what's matrix multiply?

It's a very fundamental operation for many


So you are going to produce Matrix A by multiplying Matrix

B and C. So you want to calculate each value in here

to take a row of B and a column of C, and you get each

element of A. And you do that for every element of A. And

this is the possible simplest code you can write in a

language like C, a [? three-loop nest, ?]

and this matrix multiply.

So, so far, things should be pretty simple.

So what I did was I looked at matrix multiply and said,

look, you have taken software engineering class.

You have thought about object-oriented programming,

lot of nice techniques to build very

useful, portable software.

And let's try to write matrix multiply with

some of those concepts.

So the first thing I want to do is matrix multiply write

object oriented using immutable classes, and that,

in fact, I want matrix class to represent both

integers and doubles.

So here's my class structuring here.

And actually, I'm going to show you a bunch of code now,

next couple of slides.

I'm not expecting you to follow everything


I'm going to go through somewhat faster.

But you have the slides.

So I always expect you to, perhaps, in some free time, to

go and look at some of the slides, get a little bit of

better feel for what happened.

So here's what matrix multiply does.

I have a value class that figures out the matrix type.

And if it's integer, it keep integers.

If it's a double, I keep a double.

And here, basically, do this value in here.

And then I have a matrix class.

Basically, given a matrix type and a matrix, it can create a

matrix here.

This basically creates your matrix in here by first

creating a bunch of rows here, and then instantiating each

row in here, whether it's integer or double, depending

on the type you want.

And then, of course, there are other things in matrix,

basically, in here.

And you can update a matrix.

This is going to be a little bit interesting.

I will give a little bit here, because what I'm going is I'm

doing immutable types.

That means I can't change anything in the matrix.

If you are updating, you have to create a new matrix.

So I'll show a little bit about how to

go about doing that.

And then there's this row class.

I'm going through this fast, because

it's not that important.

If you get a chance, go through this and get a feel

for what I'm doing.

But the idea is doing this very software engineering way

of implementing a matrix multiply.

So then here's my matrix multiply in here, that what

I'm doing is I'm updating each of these matrices by giving a

value of B and C and adding it to A and

updating A again in here.

So when I run matrix multiply, 1,024 by 1,024 matrix, this is

the time I got.

Actually, it took a long time, in fact.

And the key thing, is the performance good?

That's the first question to ask.

Because you run, you get a result.

When the result is correct, OK, great.

You get the correct result.

But how do you do?

And this is sometimes a very hard thing, to know that you

have a problem is itself a huge win--

that you figure out you have a problem.

And what I'm going to do is go through a little bit of a

calculation to see how we did it.

It took about five hours to do this matrix multiply, and that

doesn't look right already.

So if you think about matrix multiply, it's an N-cube

operation, basically, what matrix multiply is doing.

That's the algorithmic [? class. ?]

If you look at Charles' book, that's what it would say.

It's N-cube operation.

So this is the number of operations you want to do.

And for each matrix operation, you only do three index

updaters, add, a multiply, and a check--

probably six ops to do, basically.

And that means I need to do this many operations.

Basically, you can think about machine operations to do,

basically, because you have to read three values, update one,

do a multiply and add, and do a branch.

And if I look at that with my hours, I am doing this many

operations per second, which looks high.

Big number.

But my machine, actually, on my somewhat old PC, transfers

even much faster.

So if you think about it, I am taking about 8,300 cycles to

do any kind of a machine operation.

This doesn't look right.

So at this point, this is what I would call a "backup

[UNINTELLIGIBLE] calculation." I haven't proven anything.

This one worked well in the theory class, but if you can

just jot down a couple of these things, you can get a

feel for where you stand very fast.

And then you realize, OK, I am a couple of orders of

magnitude off here on what best I could do.

And that should give you a feel for what's going on.

OK, so what's going on here?

How can we improve performance?

Because we know we have a problem.

So what we can do is we can look deeply

into the program execution.

And at some points, OK, this program runs slow.


time spent.

So we can figure out time spent, basically, things like

by method--

which method you spend most of the time running.

Sometimes by line.

And we will learn a bunch of profiling tools that will give

you this feel.

So instead of reading a million lines of code, looking

at every line in there, you can very fast converge onto

the culprits where most of the time being spent.

And that's a very easy way of triaging your problem.

And you look at things like time spent, cumulative time

spent within that and everything called underneath,

number of invocations of the functions.

And a lot times, you have a feel for what it should be.

And when you look at this, if it doesn't look right, you

say, wait a minute, this goes against my intuition.

Perhaps there's a problem.

So what you can do is, by doing this, you can

identify hot spots.

Basically, if 90% of the time it's it one routine, that's

what you want to fix.

And hopefully, that routine is doing something too slow.

Then you can actually get good performance in here.

So here's the profile I ran for a matrix multiply.

And interestingly, what I found was, most of the time

spent in basically creating doubles--

number of calls, time spent here, this much time was

spending, basically creating new values in here, this

initializing a double in here.

So I'm doing matrix multiply.

Why should I be initializing values?

So this is an issue.

And I have the most number of calls to that too.

So we can go look at it and say, hey, that would seem to

be a problem here.

So the problem here was kind of obvious.

What happened was I didn't use immutable [? class. ?]

So this is my matrix representation.

I have a bunch of pointers pointing to each row.

And now assume I won't update this element.

And of course, I can't modify the entire thing.

It's immutable.

So what I had to do is create a copy that updates.

And then, of course, I had to create a new matrix.

I don't have to copy everybody else, because these guys were

not changed.

So I can place a pointer to these things and create this

new matrix that points to these things.

And then, if you think about what's going on, in fact, and

this is my new matrix after I do that, I [? updated ?] two

end copies for each update, because what I had to do is I

had to copy this entire row up to get that.

And I had to create this entire index again for each

point in each row.

So just to update one element, I am creating two end copies.

So what that means is now our N cube matrix multiply ends up

being N to the power of 4 algorithm.

So suddenly, you realize, wait a minute, I'm doing something

bad in here.


N-cube algorithm, and the way I saw, it becomes n to the

power of 4 because I'm making all these copies in there.

So this is a bad thing to do, of course.

And copying is very costly, because you're making

duplicates in here.

And of course, you're creating a huge amount of garbage,

because you are just copying and leaving the previous value

for garbage collection.

So garbage collector's coming up all the time, and huge

memory footprint.

So this is no good.

And can we do better?

Hopefully, we can.

So the next thing we did was, OK, just get rid of these

immutable types.

That seemed to be a [UNINTELLIGIBLE]


And then, I actually have--

I will go faster over the code, so if you are

interested, you can go look at the code in here.

So here, what we did was, still we have the issues like

object oriented and ability to represent both

integers and doubles.

So what that means is each matrix can be the integer

matrix or a double matrix.

So this is where you're instantiating with integers or

doubles in here.

So we do that.

And how much you think this can improve performance by

doing that?

In fact, this was pretty nice.

I actually got a 219x performance improvement

because I went from n to the power 4 algorithm to n to the

power 3 algorithm.

So here is interesting, algorithm has changed with it,

and we have a pretty nice performance improvement.

So if I look at this now, what I find is I'm spending a lot

of method time, I'm doing matrix multiply.

That's good.

But I'm spending all this time doing get

double and get methods.

I'm trying to do matrix multiply.

Why am I spending all this time trying to do

get double and get?

What I found was issue is the method call overhead.

Now, what happened is matrix row has two different matrix

types, the integer row and double row.

So every time I want to get a row, I don't know if I have

integer row or double row.

I have to do a look-up to figure out my type.

Am I working with the integer matrix or matrix in doubles?

And this is what we call "dynamic dispatch." So instead

of knowing exactly the methods we are to go, I had to

actually look up, OK, which method I'm supposed to go.

I have to do that since, if the matrix is integer, I might

have to call a different method than matrix is double.

I have to check that.

And then that call, instead of direct call,

becomes indirect branch.

I have to check that and basically branch on that.

And these indirect branches are costly, because what that

means is I don't where I am going until I do that test.

So no machine--

we'll go into details and pipelines and

stuff as we go on.

What that means is normally, a machine can have a lot of

instruction in flight.

So that means that I have to know a lot more information to

get the instructions going.

But here, until you resolve the instructions, you can't do

the next one.

So it's called a, basically, pipeline stall.

So instead of having hundreds of instructions going,

basically, suddenly I can't do anything until the previous

instruction is done.

So the machine slows down a lot because of this.

Normally, what you can do is keep fetching next

instructions here.

We have to fetch, test, then the test result is available,

then that's when you can fetch the next instruction.

Let me go through that file.

So direct batch means target address is known.

You can fetch ahead of the targets/ If I know where I am

going, even if I don't get there, I can fetch ahead of

the target.

I can go start fetching what happens next ahead, basically,

into [? the branch. ?]

Sometimes, you can even fetch both directions, or the

direction you might always go might get

fetched by the architect.

We'll talk about a lot of architecture details in the

future lectures.

Indirect branch means until I calculate the address, I have

no idea where I'm going.

That address is only calculated.

And because of that, hardware slows down like crazy.

So this seems to be not a good thing to have these integers

and doubles instead of having one nice class of just


might be nice.

I just basically create a double class.

It's all matrices are doubles.

If I want integers, I will just rewrite my matrix class

or copy and replace.

Now my matrix class becomes much more simpler.

This is what I got, and this is my row in here.

And voila, minute I do that, I basically went up about a 2.4x

performance improvement, because now I don't have to

keep doing that.

And altogether now, I have improved my performance by

about 500x here.

And for ops per cycle, we have gone down from

8,000 to 38 to 16.

So we do still better.

And now we look at, again, my profile data.

And what I see is, OK, I am spending a lot of time in

matrix multiply.

That's OK.

But also, I am spending time in this matrix get method.

I am trying to get each, basically, row and data

elements of the matrix.

So here is the summary of these things.

I'm going to just keep that one, because I gave you


So the problem in this object-orientedness is I

[? handle ?] each row is represented separately--

nice objects.

These objects overload the memory.

And in fact, if I want to get to my row, I have to appoint a

[? chase ?]

to go get that.

It's not contiguous in memory.

And to get to every row, I have to actually go through

this [? point ?] indexing in here.

[UNINTELLIGIBLE] matrices are one, nice, simple objects.

So we can say, OK, wait a minute.

This method call overhead actually keeps dominating.

And because I have done this nice object-oriented thing in

here, so I can get to the object-oriented file here and

say, OK, get to the objects.

And I write this nice loop of [UNINTELLIGIBLE].

And I have two-dimensional matrices in here.

And voila, I got another 2.2x performance

improvement in here.

OK, now I am down to about seven cycles for each

operation in here.

That's good.

So what can we do here?

So we did all of these things in Java, because it's supposed

to be a nice language.

But Java has a lot of overhead.

It's like you do memory-bound checks.

You have to do this bytecode interpreter

and stuff like that.

On the other hand, C can run much faster.

You don't have to do any kind of memory-bound checks.

The compilation happens before you run, so it is not part of

your cost of doing anything.

So this can work very much like C. You can just almost

cut and paste.

And if you know, C is [? index chain, ?]

[? and this ?]

to C is just changing a few lines.

And I did that, and this is, basically, the C code in here.

This is the matrix multiply.

This is the kind of allocation in here.

C can be a little bit painful to allocate.

This is my matrix multiply.

It will look similar.

And I got 2.1x just by going from Java to C. This, I was

surprised, because that piece of Java code looked very much

close to C. But by just doing that, I actually got another

2.1x performance gain.

Now this is kind of the high-level stuff.

Now we have to go to nitty-gritty

details in the hardware.

So in modern hardware, there are these things called

"performance counters" that hardware [UNINTELLIGIBLE]

that we can look at what's happening inside hardware.

We can get a feel for what happens inside hardware.

And there are things called CPI, which clock cycles per


So that means if the instructions are running very

slow, you will have multiple clock cycles per instructions.

That means instructions are stalling.

Something is wrong.

Cache misses.


[? pay attention to ?]

caches and stuff like that.

If data are supposed to be in cache and have cache hits all

the time, but if you are getting cache misses,

something is wrong.

Accesses are not good.

And then also the instructions retired.

That means, how many instructions are being

executed and retired?

If you are doing too much work, the number of

instructions retired will go high in here.

This will give you a few how many [? work been done. ?]

So if you look at this one, I have a CPI of 4.

That means for every instruction, need about almost

close to five clock cycles, which is not good.

I have pretty high cache miss rate, too--

L1 cache miss rate and some L2 cache miss rate.

This shows how many instruction are in what they

call [? vector in the ?] instruction form called SSE


And here's the number of instructions got retired.

So we just use that baseline case.

So one thing I can look at was my cache miss rates seem to be

pretty darn high.

What might be happening here?

So if you look at matrix multiply--

so I have Matrix A in here and Matrix B--

what happens is, to calculate this value in [? Matrix A ?]

here, basically, I'm going to go through this row in here,

and I'm going through column in here.

Problem with the column is it's all [? low in ?]

the cache.

This is your linear memory.

So because this is here two dimensional, but inside your

hardware, there's no two-dimensional memory.

You have a linear memory, so this is your linear memory.

This is how the memory look like.

And this is how the memory look like.

So each row and column is in these areas.

But here, to calculate this value, this is nice.

I go through memory that is contiguous.

But C, I am jumping all over my memory.

And jumping on memory is not that great cache behavior.

And contiguous can do much better, because what you do

is, when you get a cache, you get a cache line that has more

than one word.

So in here, what you're doing is you are getting large cache

lines only using one word before you go to the next one.

So that's not great in cache behavior.

So what you can do is you can make this memory contiguous.

So what you can do is, in matrix multiply, normally, n

to the cubed computation to the squared data.

So to help on n to the cubed, it's OK to do some n to the

squared processing.

That means you can do a data transport before you start

matrix multiply.

Do some pre-processing.

So this will also come in handy.

A lot of these problems you'll do, you'll say, wait a minute,

can I do a pre-process that's cheaper that will really

improve my main processing?

This is a theme that comes again and again.

And then we can get this n to the cubed running faster since

this n to the cubed is much larger than n to the squared.

You can amortize the cost of doing that.

And basically, to get that, we can transpose some matrix

going into the squared operations.

And then now n to the cubed might run much faster.

And in fact, what I have done here is do the transpose of

this matrix.

And now we run this one, and then everything will be in

contiguous memory.

And voila, I got another 3.4x performance improvement.

Now I am running almost at one cycle per op,

which is really nice.

But modern superscalars actually can do better, so we

can actually end up being even better.

So let's see.

So we are doing that.

So here is the interesting thing.

Now I look at my transpose [? spot. ?]

So if you look at that number of retired instructions, it's

more or less the same.

Number of SSE instructions, more or less the same.

I have a 2x improvement in my L1 cache miss rate.

But more than that, I have a 5x improvement in the CPI.

That means my instructions are stalling a lot

less now than before.

So this is where I got all my performance [UNINTELLIGIBLE],

because my instructions are running much faster.

So by looking at data--

this is the kind of things you're

learning during the class--

then you can get a feel for what actually is happening and

why you are getting that.

You can go do more in the cache, in memory system.

So remember, the memory system, if you're

[UNINTELLIGIBLE], if you look at cache, what they're doing

is we have a small amount of memory called cache that are

very fast access.

And you have a large amount of memory that is sitting out

that has slow access.

So what you want to give user feel as if they have a lot of

memory, everything is very fast access.

So hardware does a lot of crazy things to give you the

illusion that you have very fast memory and you have a

huge amount of that.

Of course, that's not true, because you have these caches.

And what happens is, if you do things wrong,

these get to slow down.

So in the cache system, you have cache [UNINTELLIGIBLE]

You have each multicore going to its own L1 cache that goes

L2 cache, which goes to L3 cache, which goes to the main

memory system.

And you want all of the data to be here.

If that's the case, things run very fast.

If not, you have to go here.

It will be a lot more slow, here, even

slower, here, very slow.

So the key thing is to get the data as much as possible here.

So here's the kind of cycles, basically.

One cycle to get here-- three cycles.

If you go to L2 to L3, you have to do 14 cycles


If you go to L3 to main memory, you might sometimes

get hundreds of cycles delaying here,

so it's very costly.

So the caches are very temperamental beasts.

They work beautifully, but when they don't work, these go

really haywire.

So a lot of times, if there's a performance issue, look at

the cache miss rate and say, aha, the

caches are not behaving.

And can I help in here?

So the interesting way to look at that is

how much data I touch.

These graphics didn't show up.

Let me explain.

So if you do matrix multiply, I am trying to calculate one

value in here.

So this is one value.

I have to calculate row and the column.

This slide got screwed up a little bit.

So just look at this calculating [? row bar. ?]

So if I want to calculate a row of values here, what I

have to do is I have to get this row in B, and I have to

use entire C to calculate one row of A. That means I have

touched this much data items just to get one row of A. So

that means I am calculating 1,024 values of A. I am

touching this much data items to get that.

On the other hand, if I do block matrix multiply, that

means I calculate, basically, a block in here.

This block is also 32 by 32.

It has 1,024 elements in here.

But I only need a block here and block here.

Voila, I only have looked at 25,000 elements.

I got the same number of output calculated.

But I have only touched much less elements altogether to

calculate that.

Do you see that?

And nice thing is, if this thing fits in the cache, I am

much better than this might not fit in the cache.

So I can calculate blocks in matrix multiply by looking at

a lot less number of data, that has a good chance at

fitting in cache-- and of course, when you calculate.

you make sure it fits in cache--

than trying to just calculate row by row.

So one thing you can do is--

this is a lot of common things we do.

There's many ways to get same results.

You can change the execution order.

You can change the algorithm.

You contain data structures.

And some of them actually might do

better than the other.

And here, we can actually say we can do the execution of the

change and make sure we are only looking at small amount

of data, you get the same number of data calculated.

And of course, sometimes you have to be careful, because if

you do things in different order, you might not get the

exact same results.

So if you are a numerical [UNINTELLIGIBLE], you say, OK,

of course, if you're doing A plus B plus C, what order you

do will make a difference [UNINTELLIGIBLE].

So if you're really careful, you can't do too many things.

But most times, you can do some of these things.

And in fact, in our final thing, we might say--

the final project--

we will give you something that you can change the

algorithm as long as the outcome of the [? image ?]

is the same.

If you look at the [? image, ?] and the

[? image ?] outcome is the same, you

can go change things.

So a lot of times, you might have a lot of freedom to

change the amount of errors you get, the precision, and,

in fact, doing that, that you can actually get a lot of good


So in here, we are not trying to change the matrix

multiply, of course.

But we are going to change the way order of operations is

being done.

So some people say you can't do that.

It would make matrix multiply different.

But in most cases, it's OK.

So what we have done is done what we call

block matrix multiply.

So instead of calculating rows, you

calculate blocks at a time.

So calculating each block, you only need a

few amount of data.

So this is block matrix multiply.

And voila, by doing block matrix multiply, I got even

1.7x performance gain from that.

Now, it's very interesting.

Now with each block cycle, I am doing two operations.

So my [? superscale ?]

is finally coming to being.

So every clock cycle, I am doing

multiple operations here.

So this is very interesting, the number here.

So now when you look at what happens here, I am actually

executing a little bit more instructions here.


Because I did block matrix multiply.

That means I am doing a little bit more overhead in here.

I have more loops in here.

My inner loops are small.

So I'm doing more instructions to do execute there.

So I'm doing more instructions, but I have a

huge cache performance gain.

My cache miss rate now going down to very, very little--


And I have an 8x improvement in cache, and that means I got

a 3x improvement in my CPI in here.

That's very nice.

That means even though I'm doing a little bit more work,

my cache behavior really, really helped me in here.

So kind of says why I am getting this performance gain.

So what else can we do?

We can go look at trying to do a lot more crazy optimizations

because if you look at the modern Intel, they have this

thing called [? retro ?] instructions.

We'll get a little bit more detail into that later.

And what we can do sometimes is you can nudge the compiler

to take advantage of these instructions.

Of course, if nothing helps, you can actually go write to


But hey, I don't want you guys to write assembly code, at

least in this class, even though you can get really good

performance sometimes.

But you can kind of give enough hints to compiler and

pray and kind of [UNINTELLIGIBLE]

there's no nice way on these things, and hope for the

compiler to do the right thing.

So in here, we can play with, a lot of

times, compiler flags.

And this is not exact size.

We keep trying those things, and sometimes we can even get

information back and say, OK, what did the compiler do?

It might give you some good hints and say, I couldn't do

something because of this part of the code.

And worst comes to worst, you can actually generate assembly

and stare at it.

And sometimes in this class, you might actually want to

stare at assembly and see what goes on in here.

So here is the assembly for here.

And this is my loop body in here.

And then what I can see, I'm not going to go through that.

Everything is converted into SSE instructions.

At least I know what SSE instructions mean, so this

seems to be doing well, so I'm happy.

And in fact, I'm really happy because I got a 2.8x

improvement by doing this.

Now what I am actually doing each clock cycle about

[? five of my ?]


So I'm going really fast in here.

So now, incidentally, I'm running my program about

30,000x faster than when I started.

So if you haven't noticed, these things

slowly add up in here.

And now, if we look at what's going on here, it's very


What I have done is actually, my each instruction is now CPI

has gone down.

That means each instruction is running slower because I'm

doing these SSE instructions.

And in fact, my cache miss rate has also gone up.

But I am doing a lot less instructions, because SSE

means in one instruction, I am doing a vector.

So I'm actually getting four operations done in one


So the number of instructions I execute has

drastically gone down.

So even though my instructions are running a little bit

slower, I can basically amortize that, and I get good

performance in here.

And of course, this one says, OK, now 88% of the

instructions are SSEs.


most of the things to run on vector mode in here.

And of course, you can keep doing more and more and more.

And there are these four things [UNINTELLIGIBLE] matrix

multiply, there are people who spend their entire lifetime

trying to run it faster.

And they have libraries like that, and this thing called a

BLAS library and this thing called Intel Math Kernel.

So they have done a lot more other things, like

pre-fetching, getting everything right.

So here what I do is basically call their library, don't do

anything myself, and this library will do it for me.

And in fact, that library itself gave me another 2.7x

performance gain in there.

Now I'm actually doing 11 instructions every clock cycle

by doing that.

And if I look at it carefully, what they have done is they

are also very into SSE heavy, so they are running almost the

same amount of instructions.

But they managed to actually get, again,

a better miss rate.

They probably figured out some stuff in here, things like

pre-fetching instructions.

And that means they actually brought down the CPI back to

where it was before.

And that's why they gave that performance [INAUDIBLE].

And finally, we are doing a little

bit of parallel execution.

So as you know, multicores are here.

And you're actually going to these 12-core machines, which

is really nice.

Fastest possible machines you can get your

hands on these days.

And it's kind of fun to work with them.

And we can basically get concurrency for parallel


And a large part of this class at the end is trying to get

good parallel performance.

But there are a lot of issues with parallelism.

I will go a couple of issues now.

Of course, when we get to that part, we will go in detail.

One thing is called Amdahl's Law.

Because if you're trying to run something parallel, you

can't run everything parallel.

When you start something, there's part of the code that

basically starts the computation.

And then there are the parts you can run parallel, and then

other parts you have to also wind down,

measure, things like that.

So what Amdahl's Law says is, even if you have an infinite

number of processors, the maximum speedup you can get is

this-- this the part of the code that are sequential and

[? parts of ?] the code are parallel.

That means if you have only 90% of the code can be

parallelized, the maximum speedup you can get is 10,

even if you have an infinite number of processors.

So 99% of the code is parallel, the maximum speedup

you can get is 100.

So this is very interesting, because even if you have huge

[UNINTELLIGIBLE], if your code has a certain amount of

sequential part, there is a limit of what you can do.


also think of load balancing.

That means when you get parallel, what you are

assuming is all processes are busy all the time.

That's very hard to achieve.

Sometimes some processes are busy.

Sometimes others are not.

Sometimes something's finished early.

So if that happens, even though you have parallels, you

might not get really good performance in here.

And then, of course, there's a thing called granularity of


Normally, what happens in programming, you run

sequential part, you run parallel part, you run

sequential parallel part.

To start parallelism is a big task.

You have to tell all the other processors, OK, look, you guys

have some work.

Go do that, and finish that, and come back.

And that itself is very expensive.

And the work you give is very small.

You give the work.

The start-up and tear-down cost of parallelism itself

might be too much than the speedup you get.

In fact, you can make programs parallel and slow them down

really drastically.

So you have to be very careful in that, because you can say,

oh, I'm running in parallel, but it's running slow.


Because you might have a granularity issue, because we

are giving so little work that even though you're getting

parallelism, you might not get that much performance.

So if you look at matrix multiply, what happens here is

you can divide the matrix into two different parts in here

and calculate as two different matrix multiplies in a

different core or a different process.

Or in here, I can make it two or whatever in matrix

multiplies to do that.

So this is nicely parallelized.

No big issue in here.

And when I parallelize, I got rather 3.5x performance


Of course, I could have gotten more here, but because I had

to start here, since I didn't want to wait for a week for

this to finish, I had to come up with a smaller matrix size.

But now, when I get here, this is so fast, my

granularity is too small.

So that's why I didn't get--

because this was an eight-core machine, I didn't get 8x

because the minute I get parallelism, it just is over

because it's running so fast in here.

Now, at this point, using eight cores, I am actually

getting 36.

So this is not that interesting, because this is

not running on one core, every clock cycle, I am trying to

get 36 operations done.

So the interesting thing to follow in here is that this

example might be somewhat contrived, because I started

with this immutable-type matrix multiply.

Hopefully, you and your colleagues who haven't taken

this class probably won't go that far.

But even if you start somewhere here, still, there's

a huge gain to be made.

So to put this in perspective, OK, so if you look at here,

even if I start here somewhere out in C, I already got 131x

performance improvement.

That's huge.

So that means in this class, we are not

talking about small things.

And in fact, this is why I had to go log scale in doing

absolute grades, because we had people who figured out

exactly how, in some problems, to change data structures,

change execution ordering, do some really cool thing, get

1,000x performance improvement.

And there are other people who did little bit things and got

2x improvement.

So the fact that what you can do the best here can be

really, really high.

So that's why we actually started giving you the second

chance to also go basically see, if you missed a critical

point, to go back and learn that.

So to put this in perspective--

so in summary, what we have done is--

matrix multiply might be an exception.

Every time you go, you might not get that much, but in

here, going from this very software-engineered, nice


to a BLAS [UNINTELLIGIBLE], I got a 296,000 times

performance improvement.

This is huge.

If you put this in perspective, if you look at

miles per gallon between these two-- a

supertanker and this scooter--

it's only 14,000x, basically.

So what that means is you have to have 20 supertankers to

basically match this much improvement.

So if somebody can tell you, look, I can't get a

supertanker to run at the same miles per gallon as your

scooter, that would be revolutionary.

But you can do something similar in your computer.

And in fact, if you look at these large data centers, the

computers are actually burning a lot of

energy, a lot of cost.

And in fact, if you can reduce that, that can have impact.

So what you get out of this class is to think through

these programs this way.

Learn about correctness before [UNINTELLIGIBLE]

6.170, you hopefully are good at writing correct programs.

But that's not good enough in a lot of times.

And how to think about programs, how to make it run

faster, there are multiple [UNINTELLIGIBLE] necessary.

So for example, in here, it's basically direct energy.

So can you get the data center to burn 10% less energy,

that's huge.

For example, if you look at today's supercomputing

centers, they spend about $10 million a year on energy, just

for the power grid.

So if you can say, I'm [? playing with ?]

10%, that's huge.

Other thing, one thing Charles says a lot of times is, what

performance gives you is currency, because performance

itself might not be that directly useful, but it gives

you currency to buy something else.

What it is?

So assume you have a nice application.

Your GUI is doing OK.

It's up to the point that people don't feel stalled.

But if you want to do something addition, if you add

a feature, you don't want to slow down the program too

much, because your users are going to complain.

So what performance gives you is, if you improve the

performance, it gives you currency, the ability to buy

some additional features into your system.

Or you have a system that you created a start up, and you

created this nice server somewhere.

Suddenly, it became very popular, and a million people

want to use your software.

And the software might not be ready to do that.

It might just crash and burn, or you might have to pay a lot

of money for Amazon for the [? cycles, ?]

so by actually making programs faster, you can make it

scalable, you can make it efficient.

So this gives you the ability, this currency, to go do a lot

more other interesting things to your software because now

you can run it faster.

It gives you this margin to play with.

So in this class, we are going to learn a

lot about these things.

We are going to learn about architecture, how looking at

the architectural issues, how to identify when

something is wrong.

We are going to look at algorithmic issues in here,

how to look at things.

We are looking at architectural things like

memory systems, parallelism.

And also we will talk about all the tools, so you will

actually know something has gone wrong.

Because today, you run the program, it runs, it runs

correctly, is it good enough or not?

How do you identify that?

That's, I think, the very interesting place to start.

So that's all I have today.

Make sure you get a Project 0.

And especially people who haven't done C, we will--

Charles, are we going to do a C remediation?

We did last time, didn't we?

OK, guys, one more thing.

Last year, we did an evening C remediation class.

So if you know Java, if you want to know C, we will send

out a time with TAs that will give you kind of a one-hour

crash course into C in there.

And we will send information.

So this class moves very fast.

Get on with Project 0, and see you next week.

The Description of Lec 1 | MIT 6.172 Performance Engineering of Software Systems, Fall 2010