Follow US:

Practice English Speaking&Listening with: Mercurial Project

Normal
(0)
Difficulty: 0

LESLIE: mercurial revision control systems, sans slides

for the moment, but soon with slides.

By Bryan's own admission, he builds

large distributed stuff.

BRYAN O'SULLIVAN: Not large by your standards.

LESLIE: Not large by our standards, though.

And I just wanted to let you know this talk is going to go

up on Google Video, so if you have any questions that you

think might contain information that's

particularly Google-y, let's hold those until the end after

the camera is off.

BRYAN O'SULLIVAN: OK, thank you, Leslie.

Let me see what a good distance is here.

All right.

So about 12 months ago, I was casting about for a revision

control system that I could use to write the next late,

great desktop email system that nobody was going to use.

And unfortunately, during my passing around the place, I

found that there was nothing entirely suitable to my needs.

So I found a piece of software that was almost suitable,

which had just gotten started on by a guy that I happen to

know, and this was a tool called Mercurial.

And the origins of Mercurial are steeped in the great

BitKeeper debacle of 2005, when Larry McVoy took his

marbles and left the playground.

So at that time, Linus Torvalds was left without a

revision control system, and started writing his own.

At the same time, Matt Mackall started working on a revision

control system.

And they converged on fairly similar designs, although in

substantially different ways.

And so now the world has two 14 month old revision control

systems, instead of one, to join a field of a huge pile of

free revision control systems already.

So why am I actually interested in

this particular one?

Why am I in here to talk to about this?

Well, there's a couple of different reasons that it's

interesting to me, here.

And one is that Mercurial kind of matches a few aspects of

the Google state religion, as I

understand it from the outside.

And those are that it's written in python, it is

distributed, and it does things very fast. So these are

kind of nice properties to have.

Now, it's not completely python.

It's only 95% python.

There's a couple of core routines that

are written in c.

But what we've done in the 12 months or so since people have

started actually using this stuff is a fair number of

third party people, both open source and commercial

projects, have actually started using it.

Which is really kind of an interesting vote of

confidence, right?

Normally, revision control holds the crown jewels of

whatever it is that you're doing, so it had, by God,

better work properly, or else you're going to have serious

problems.

But in that time we've had a couple of interesting people

start to use Mercurial.

We have the XenSource people, those are doing the open

source Linux hypervisor.

We have the One Laptop per Child project, the OpenSolaris

project, the MoinMoin wiki.

A pile of other people who are playing around to various

different extents, some of them large,

some of them small.

But it's pretty fun to be working on something that's

young, and yet, has interesting and

active users already.

So from a developer's point of view, you're sitting down in

front of your computer, and you're about to start your

magnum opus.

You're going to work on the next great novel, or the next

great huge source space, or whatever it's going to be.

You're looking for criteria to choose one of these revision

control systems. There's got to be at least 50 of them

extant at the moment.

Ah, thanks, Chuck.

AUDIENCE: We'll see if this actually does the job.

BRYAN O'SULLIVAN: So what are some criteria?

Well, I can tell you what I used, at least, for my own

personal purposes.

I wanted something that was straightforward to understand,

so I was going to be able to spend my time thinking about

my problem instead of my revision control problem.

I wanted something that was pretty quick, so that my tea

would still be warm when I was finished with a particular

revision control operation.

And I wanted something that would help me to work

efficiently with other people, that being kind of a

motivating factor.

Now all you want to get this onto a Windows box?

I think we're going to be slide-less.

MALE SPEAKER: It'll work.

BRYAN O'SULLIVAN: It'll work?

MALE SPEAKER: Yeah, Open Office, you say?

BRYAN O'SULLIVAN: Yup.

MALE SPEAKER: It'll work.

BRYAN O'SULLIVAN: OK.

So the Mercurial revision control system has a

conceptual model that's very simple.

I've had it described to me as, you can carry it around in

your head, and I like to think of it that way myself.

There are basically three different things that you need

to pay attention to if you're thinking about revision

control, in Mercurial and in many other distributed

revision control systems. The first is that you have a

repository, which is where your stuff lives.

The second is you have a working directory, which is

where the stuff you're working on lives, and the last one is

a change set, which is a snapshot of what's going on.

So in Mercurial terms, a repository is not a heavy

weight thing.

It does not have a database behind it, it's

just a pile of files.

So these are things that are easy to create, they're easy

to administer, you could put one together and blow one away

in a matter of moments.

And that's how people prefer to work.

They are, as I mentioned lightweight, and they are

pretty much everywhere.

Everywhere that you happen to be doing some work in a

working directory, there is a repository that happens to be

wedded to it.

So inside a repository, there are only really three

different things.

There's a change log, which says, I've done this work.

There is a manifest, which says, I did the work on these

versions of these files, and then

there's the file metadata.

That's all you need to know to understand the underpinnings

of the revision control software.

It's very, very straightforward.

Now you contrast that model with the internals of a more

traditional large scale revision control system, like

ClearCase, or Perforce, or something like that.

And if the internals are visible to you at all--

they're typically not-- they're going to be a big pile

of different things.

The Mercurial source code is, I think it might have grown to

almost 12,000 lines.

But it's a fully functional system, and it's

12,000 lines of python.

So you can keep all of it in your head as a developer, you

can keep all that you need to know in your head as a user.

It's got a couple of very good properties in that respect.

Now a change set, as I mentioned, is a snapshot of

your project as it stands on a particular point in time.

It corresponds to a revision in Perforce, or Subversion, or

any of these other tools.

And the terminology that we use for creating a change set

is committing it.

Now unfortunately, I have some pretty graphics here that

you're not able to see at the moment about how you go about

creating these things.

And what starts being interesting then is how you go

about merging and branching with people.

So I'll have to hand wave since I have no graphics to

offer here.

When two people work, they will create clones of each

other's repositories.

Yes?

AUDIENCE: Would a whiteboard help?

BRYAN O'SULLIVAN: A whiteboard would actually help, yeah.

When I start working, I create a repository, and it's just a

little directory called .hg somewhere.

And outside that repository is the actual directory.

Let's say I'm working on a project called foo.

So I have foo/.hg, that's where all my metadata lives.

I don't need to actually care about

anything that's in there.

The working directory, where I have my files, like copying,

and readme, and foo.c lives around this .hg directory, and

that's where I do all my actual work.

So somebody else will make a clone of this, and they'll

have an identical copy of the working directory, and an

identical copy of the .hg directory.

And they start working there, and they

start creating a revision.

And let's say they create revision one.

And I create something, and I call it revision one, too,

because I wouldn't know about their changes, because this is

a distributed system, right?

So we go on, and we create revision two, say, as well.

Now the next thing that we need to do here is we need to

be able to communicate with each other to say, we're

working on the same project, and now we have diverging

views of the world.

How do we cause them to reconverge?

Well, what I do is, I pull my changes from one repository

into the other, and that literally means I just take

the direct [UNINTELLIGIBLE] graph that is all of my

revisions, and I plop them in here.

And after I do a pull, I do a merge.

And what the merge does is it just creates a structure in

the working directory such that I have my revision one,

your revision one, my revision two, your revision two.

And then, in the working directory, the working

directory has a notion of there being parents.

So you can think of the working directory as a

floating revision.

It's the last stuff that I had, the merger of these two

guys, and the stuff that I'm about to commit as a

result of the merge.

So the working directory gets all these things, I do my

commit, and I'm done.

And that's all there is to it.

So a branch in Mercurial is just a revision that has two

different parents.

Now I haven't actually drawn a branch here.

Look, here's a branch.

This is a revision that has two children, and this, being

a merge, is a revision that has two parents.

That's all a merge is.

So there's no special sauce there.

If you think of a branch and Subversion as being something

very simple, it's just a copy, the analog in Mercurial is

also it's very simple, it's just two revisions that happen

to have the same parent.

And of course, you can have an arbitrary number of branches.

We only allow merges one at a time, because nobody's really

found a good way to explain multi-lane merge to people,

and their heads explode.

Heads exploding, not so good.

So when people work in parallel, and they make these

changes, and they commit them, and then they merge after they

change, that's kind of a nice property to have. Because if

you think back to what you used to have to do, in the

days of CVS, history was linear.

People would do a change, they'd do a

change, they'd do a change.

And if you won that commit race before I managed to make

a commit, I had to merge with your changes before

I could do a commit.

And that meant that there was no permanent

record of my changes.

Now with this model, you don't end up having

that kind of risk.

With CVS, it was very straightforward to shoot

yourself in the foot by screwing up a merge.

It happened all the time.

You'd see changes that got checked in with conflict

markers, badness ensued.

Subversion sort of avoids this, but the default policy

is to work the same way that CVS does.

In Subversion, you have to explicitly choose to work on a

separate branch.

It's the same thing with Perforce.

The distributed tools necessarily don't work that

way, and it's a slightly safer way to do things, because one

default policy is less safe than the other.

In practice, I don't know that makes a huge difference.

You don't hear people saying that I threw away three weeks

of work in Subversion or in Perforce, or in any of these

other tools very frequently.

So with many of these things what you get is a matter of

degree in terms of loss or gain of functionality.

OK, a couple of things that are interesting to know about

Mercurial for getting going and keeping going quickly.

If you're working in a small, lightweight environment, it's

useful to have, for example, a built-in web server, which we

have. So this is a web server that you can both interact

with as a person, and view revisions in your tree, and

annotate things, and download tarballs.

But it's also what Mercurial uses to fetch data and to,

very soon now, push data over the network as well.

It also works over SSH if you happen to prefer security.

In addition, Mercurial has essentially a single way to

stream bytes onto the disk and to stream them off the disk.

We call that abstraction a bundle.

So you can put bundles onto USB drives, or you can send

them around to the email, and you get a complete set of

change history that you can transfer without having to be

online at the time.

This is kind of a useful property for people doing

distributed development.

So sharing is also a, what would you call it, a

symmetrical operation.

In other words, when I make a clone of a repository, and I

start making changes, and I push my changes back to the

parent repository--

I don't know where this buzzing is coming from--

I end up with both repositories being identical

afterwards.

OK, the buzz has stopped.

I can step back and look at my slides again.

MALE SPEAKER: You're probably standing on a wire.

BRYAN O'SULLIVAN: Ah.

So the final thing, and the thing is sort of the special

sauce in Mercurial is that Mercurial is, albeit written

in python, an extremely fast system.

We've benchmarked it under various different scenarios,

and there are only a few other revision control systems that

compare in terms of performance.

Now performance is interesting in its own right, because it

means that you don't get distracted while you're

waiting for the tool to do things.

But it also enables you to do certain kinds of operations

that are not necessarily otherwise possible.

So in order to give a little bit of an apples-to-apples

comparison, the other day I went to the Subversion

self-hosting repository, where they've had Subversion

developed for the past four or five years, and I just did a

Subversion check out of the Subversion source tree.

And the head of the source tree as a working directory is

about 72 megabytes in size.

So out of curiosity, I sucked all of the Subversion history

into Mercurial, and created a local Mercurial repository

that's an identical copy of what's on

the Subversion website.

And it turns out that the Mercurial repository plus

working directory is about the same size as the Subversion

working copy.

So I have 50,000 revisions and a working directory in 76

megabytes, versus just a working copy in 72 megabytes,

which I thought was kind of interesting.

It means that you don't have to pay much in order to get a

complete history of everything onto your machine, where you

are no longer talking to the network.

One of the things that you've probably noticed if you've

been dealing with Perforce service here is that things

take an awfully long time once your servers are busy.

Central servers don't scale terribly well.

We prefer not to talk to them if you don't have to.

In a distributed tool, you tend to not talk to them very

often at all, maybe a couple of times a week.

And so also out of curiosity, I ran a couple of a very

simple performance tests.

Now the Subversion repository is very small.

It's only about 1,200 files, and the actual working copy,

when you ignore all the .sbn directories and so on, it's

only 25 megabytes in size.

And pound for pound, Mercurial and Subversion worked out

about the same.

There were a few instances where one was faster and a few

instances where the other one was faster.

But what's interesting about one being faster or the other

being faster on a small test case is primarily that you can

write something in python, and have it be as fast as

something that's written in pure c, if you're clever about

how you do it.

And I assert without immediate proof that in fact, Mercurial

will scale to large projects better than many revision

control systems that are written in c, because of the

underlying obstructions that we use.

And the implementation techniques, might I add.

So there are a couple of different things that we do in

order to make the implementation go fast. The

primary one is that we've desperately tried, at every

step, to avoid seeks.

Disk seeks are not your friends.

Disk seeks are things that cause you to just sit around

and wait for good things to happen.

Good things that happen are streaming I/O

linearly off your disk.

That's what we really like.

So in order to stream I/O linearly off your disk, there

are a couple of things that you would like

to be able to do.

The first is you don't want to write more than you have to,

and the second is that you don't want to read more than

you have to.

So what are the necessary properties for revision

control systems to not write more than you have to?

Well, it's been dogma in revision control for a while

that what you really want to do is you want to store the

most recent revision of your file as the very first thing

on the disk, and then everything else wants to be

reverse deltas based on that.

Because that means you get the nice property of something

that you've accessed recently you can just read with a

single read.

Now, Mercurial sort of turns that a little bit on its head.

We do forward deltas from the very first revision.

But that sounds like it's a terrible implementation plan.

It's actually rather better than that, because what we do

is we have 0 of 1 retrieval properties.

So instead of having a very first revision and 10,000,000

little tiny deltas on top of that that you have to

reconstruct the final revision out of, what we do is we have

two different techniques.

One is that every so often, when the accumulated quantity

of stuff that you've deltaed gets to be too big, we

store a full text.

So you pay something of an extra space cost on disk, but

you end up with 0 of 1 retrieval properties.

The other thing that we do is, rather than applying each

delta as we go, we compose our deltas, and then we just apply

one single union delta at the end.

And that gets rid of some of the nasty properties that you

have when you're composing deltas and [? munching ?]

strings in python.

So these are two things that make it quite fast and quite

efficient to get nice linear accesses.

Another thing that you can do that gives you linear

accesses, at something of a cost in space is, you might

think that if you're doing a delta, you want to do a delta

against your parent.

So if I'm at revision three, and I have a child that is

revision seven, that is really based on revision three, I

want do to a delta against revision three?

Not necessarily.

Because if you're doing a delta against revision three

and you're revision seven, there

might be a seek involved.

So what if at revision seven, you do a delta against

provision six, and you don't actually care whether revision

six was related to you all?

Well, in that case, you pay something of a greater space

penalty, but you end up with a linear space, or pardon me, a

linear disk access, and a guaranteed lower

probability of seeks.

So again, it's another case of make the seeks go away, do

linear stuff instead, even if it costs me slightly more

space, even if it looks like it ought to

be a less good choice.

Many times these are not the case when you subject them to

a little bit of inspection.

Right, back to my hints as to where I am.

The file formats that we use are very straightforward.

They're binary files, but they're easy to parse in

python, and the reason that they're easy to parse in

python is that we use the [? struct ?]

unpack and pack methods a lot, and we use the string

splitting and unsplitting a lot.

And the reason that these are good things to you is that

they don't go through the python interpreter at all.

They go straight from the python interpreter into c.

Fast things occur, and you get dropped back into python land.

We also try and avoid things like conditionals in our inner

loops, because those tend to cost in performance as well.

So if you look at the Mercurial inner loops, they

tend to be quite tight.

They're still written in python, but they do very, very

simple things, and then they deal in terms of two

[? bulls ?] or a raise, or whatever happens to come out

the other side.

We're out of luck?

MALE SPEAKER: I can open the file, but it doesn't show me

any content.

BRYAN O'SULLIVAN: You have Open Office 1, and I

have Open Office 2.

I'm sorry.

MALE SPEAKER: That's OK.

BRYAN O'SULLIVAN: OK, it continues to be a mime show

for the rest of the presentation, I'm afraid.

The final thing that's kind of interesting about

implementation techniques is that I

mentioned avoiding reads.

Well, why read at all if you can stat?

What if you were to store, instead of just spewing your

files out onto disk, you also stored the information as to

what the stat of each file was when you wrote a the file?

Now, what do I mean by that?

I mean, you store the modification time, the size,

and the access time of each file, and the owner.

And then, when you're looking through a tree to see what was

the last thing that happened to my tree, instead of reading

the file, you do a stat of the file.

You look at the last time you've started the file, you

see if they are are different in any way, and then you a

read to see if you really need to say yes, this file has been

modified, or no.

So what this means is in a 40,000 file tree, Mercurial

does a stat 40,000 times, but it doesn't read and

reconstruct each file 40,000--

or it doesn't really reconstruct 40,000 files.

Again, this is a significant space win when compared to

doing things by hand.

So I've mentioned that things are fast. Yes?

AUDIENCE: It sounds like you are designing for local disks,

not an NFS home directory.

BRYAN O'SULLIVAN: Things are not as fast over

NFS, this is true.

So an obvious place that you can make things fast over NFS

is by using an implementation technique like Perforce, where

you have to tell a server everything that you're doing.

That's a potentially less friendly thing to do.

I was able to live with p4 edit back when I had to use

Perforce, but having to not do it, and being able to do

things on a local disk happens to suit me pretty well.

So speed is an end in itself.

All other things being equal, it's nice to have something

fast instead of something slow.

But what speed also lets you do is it lets you do things

that are not straightforward using other tools.

So Mercurial was originally developed by kernel hackers,

and the kernel is a reasonably large tree.

It's not super large, it's like 20,000 files, or so.

But one of the things that kernel developers tend to have

to do a lot is deal with patches, because there are

certain gatekeepers for pieces of subsystems, and you may be

developing something that isn't ready to go off to

somebody else yet, so you maintain a pile of patches

that your software has to work with.

And working with patches is traditionally a kind of a

painful thing, because revision control tools don't

normally have a concept of patches.

But what if you did have a concept of patches?

Well, Mercurial has an extension

called Mercurial Queues.

If you've been working with open source tools that you

have to patch in order to get working, you may have come

across a tool called Quilt.

And Quilt basically lets you maintain a stack of patches on

top off a source tree.

It doesn't care what the source tree is, it has no

notion of there being an underlying revision control

system of any kind.

So Quilt has the nice property that it will

sit on top of anything.

It doesn't matter whether it's Perforce, CVS,

or an exploded tarball.

And what Quilt lets you do is it lets you push a pass onto

your stack, edit some files, refresh the patch, push

another patch on top of to your stack, refresh the files,

et cetera, et cetera, et cetera.

You can pop and push, and work on the top of stack

arbitrarily.

Mercurial Queues works the same way, but it's integrated

into Mercurial.

So what this means is after you've pushed a pile of stuff,

patches, onto your stack, you can now use the regular log

commands, or the annotate command, to find out which

revision that has turned into a change set in the tree made

a particular change.

When you pop your patches, the change sets go away, and once

you're done, you can push the change sets off to somebody

else as regular Mercurial revisions, and they see them

as regular Mercurial revisions, and then they start

distributing your changes, and goodness occurs.

But the nice thing about this is the

integrated nature of it.

So for example, if you're doing a bug search, one of the

things that people have started using in the past year

or two with tools like Git and Mercurial is dichotomic search

of your revision graph.

Now what that means is you're essentially doing a

bisectional search.

I know I have a revision that was bad.

I know I have a revision that was good.

And I want to narrow my way down to the revision that was

the thing that caused the stick to flip, as

quickly as I can.

So if you can do this using your revision control tools

rather than having to worry about the join, the boundary,

between where my patches start and where my revisions end, if

everything is just seamless and you don't have to worry,

that's kind of a benefit.

As an example of how well Mercurial Queues works, Andrew

Morton maintains a pile of patches against the Linux

kernel that is, I think it's about 1,500 patches at the

moment that are just a single Quilt patch series.

I can apply those on top of a Linux kernel repository in, I

think, something on the order of three minutes.

So I can create seven change sets per second for three

minutes to get all of these patches into my tree.

And then, as far as Mercurial is concerned, it's just a

regular old tree that I'm working in that has 1,500 new

revisions in there that I can deal with, that I can serve

up, that I can communicate with other people, but that I

can also modify after the fact by popping,

editing, and repushing.

And then I can continue to share those changes with a

Quilt user, because Mercurial Queues is Quilt compatible.

This is a fairly tremendous win, not just for dealing with

upstream software projects, but also for prototyping and

developing your own code.

Quite often, if I'm working on a feature, what I'll do is I

won't actually start committing changes, because

I'm kind of an idiot.

I don't tend to know where I'm going a lot of the time.

I'll start off, and I'll go down a blind alley, and I'll

go down another two blind alleys along the way, and

after a couple of days I'll have converged on something

that looks like a solution.

But along the way, what I'll have done was I left fault in

terms of here is the underpinnings of

my work, one patch.

Here is another thing that I need to have in

place, another patch.

And at each point when I'm refactoring my code, or I'm

moving something between one layer to another layer, I just

moved [? hunts ?] from one patch into another patch, and

I push and pop my context at various different times to be

working on a different patch at each point.

So this is not just for working with other people.

It's for collaborating with yourself as your clue

evolves over time.

A very good property to have.

So finally, there are a couple of things that are very

interesting to me about distributed revision control

that are not specific to Mercurial, but that might be

of interest to people who do things with free software and

open source tools in general.

I have this idea that choosing a particular revision control

tool is actually making a statement about how you want

your project to evolve, right?

So if you work in a large company, and everybody has

essentially a level playing field, everybody is more or

less likely to have commit access to much the same stuff.

And everybody can pull the same changes, everybody can

push the same changes, or integrate changes, or whatever

the particular tool's language lets you do.

But out in the open source world, that's really not the

case, right?

If you're using a tool like CVS, or you're using a tool

like Subversion, there's a world of haves, and there's a

world of have-nots.

There are the people who have commit access to the one

central repository that everybody has to use, and then

there are the people who can maybe read from that

repository, they can check out a working copy,

but they can't commit.

And they may not be able to earn the right to commit until

they've proven themselves, over the course of a number of

patches that they've submitted and had to maintain.

Now if you've been in the position of having to maintain

a patch against an upstream source tree, it's kind of a

painful thing to have to do.

Tools like Quilt will make it more straightforward, but

really, what you would like to be able to do is work with

other people, speaking the same language,

using the same tools.

With a distributed tool, you can do that.

With a centralized tool, unless they're willing to

create a little sandbox that people can go wild in, which

Wiki's history, and the history of other collaborative

commons on that have proven as not necessarily

very scalable thing.

They can't do.

So you think of it as being analogous to the ascent of

man, right?

In the days of RCS and SCCS, everybody crawled on all fours

and they had to be on the same machine in order to

get any work done.

And then, suddenly, everybody's tail fell off and

they started going around, and hunched backs, and they were

able to talk to the central repository over the network.

But they had to be on the network in order to

get any work done.

Right?

If I unplug your workstation from the Perforce server,

there's nothing you can do.

If I unplug your workstation from a Subversion server, you

can run diff, and nothing else.

With the distributed tool, that's no longer the case.

I can work on the train, I can work on a mountaintop.

So long as I have history, and I can access my

hard disk, I'm set.

So I can work anywhere, I can contribute with, to any

project, I can work with anybody with

a distributed tool.

The tools don't make the boundary, it's the social

norms of your project that you explicitly choose that make

the boundary.

You choose the model yourself with a distributed tool.

You don't have imposed on you by the technology.

So I would encourage you to give this stuff a try.

Mercurial, as I said, it lets you work as a centralized

system, if that's what you prefer.

If you want to work in a distributed fashion, you can.

And one of the reasons that people have cited to me a few

times for using centralized tools is that

they're afraid of forks.

So Guido works here, for example, and he does not like

Stackless Python at all.

Stackless Python is this project that started about

eight years ago, that went off in a direction that was

fundamentally different to the way he wanted to

bring python itself.

And for example, the GCC people have had the same

problem, right?

EGC has forked off from GCC many years ago, and they

eventually managed to reconcile their differences

and start working together.

But what's interesting about using a tool that has good

support for merging and good support for branching is that,

everybody forks all the time, right?

Forking is just what you do.

And merging is what you do when you're done forking.

So once somebody has decided that they want to play nice

again, and they want to cooperate with

you, you just merge.

Whereas in a central tool, what somebody's got to do is

they've had to suck all the history of your central

repository.

They've had to shove it all back into another central

repository.

If you want to reconcile your differences, you've got a

serious problem on your hands all of the sudden, because

there's no way to make the two communicate.

That's just not the case with a distributed tool.

So what would you say, it makes it easier to reconcile

your differences.

A final couple of comments that I have

before we finish off.

There's a number of people who work on Subversion here at

Google, and I've been very conscious along the way, as

I've been trying to shepherd people into sending patches

into Mercurial, and so on, of the great job the Subversion

people have done in terms of building a good community

around their tool.

If you're working on open source software, the only

thing you have going for you is A, technical merit, and B,

credibility.

And Karl Fogel and Jim Blandy, and Brian Fitzpatrick, and Ben

Colinsussman, and Garret Rooney, and all those other

people who've worked on Subversion over the years,

they've done a very good job of making themselves

accessible, and making the Subversion community being a

place that is a good place to contribute to.

You send a patch in, somebody's going to review in

and say, yeah, could you tweak this, yeah,

could you tweak that.

It's a nice properties to have.

And I've been very explicit in trying to emulate their

example as we've been building Mercurial, because it's always

nice to try and learn from somebody else's good examples,

rather than to try and blaze a trail of your own.

And that's, I think, stood us in good stead.

I actually ran a survey of our users a couple of months ago,

just to get a sense of where people thought we were.

It's very easy when you're a developer to stay down in the

trenches and look at the next line of code you need to

write, or look at the next patch you need to issue, or

the next bug that you need to fix.

But it's nice to get a sense of what your users think about

you, and people have been pretty complimentary about us.

People are also quite happy with the software, in terms of

the fact that it's easy to install, easy to use, not too

different from tools like CVS and Subversion.

And it's been very rewarding to actually be able to talk to

people and say, look, here's this nice shiny toy that we

have, that you can use for something, be it

small or be it large.

It'll scale, and it'll work for you across all of these

different sizes of thing.

That comes to the end of my prepared comments.

Thanks very much for bearing with me as I've had to

essentially hand wave my way through.

And I'd be happy to take any questions people have. Yeah?

AUDIENCE: So [INADUBLE PHRASE]

BRYAN O'SULLIVAN: So the question is, what's the

computational cost of a merge when one person does a large

amount of work, and another person does a

small amount of work?

And the answer is that there's not very much cost to it.

So bringing in changes from the outside is essentially a

linear operation in the number of changes that you've made.

So cloning a repository, all 20,000 revisions or just

pulling two changes, they have approximately linear costs.

So one costs about 10,000 times as much as the other.

So if you've done 10,000 things and I've done two, and

I pull in your 10,000 things, it takes me about 10,000 units

of time to process those changes.

The actual merge afterwards is primarily a matter of updating

the working directory.

Most of your changes are not going to conflict with most of

my changes, and then what happens with the few conflicts

that there are is really, it's almost not a matter for

Mercurial itself.

We have a couple of different merge strategies that people

can use when there are conflicts.

AUDIENCE: But can you tell the common

ancestor between the branches?

BRYAN O'SULLIVAN: Yes.

Yes, can you tell the original common ancestor?

Yes you can.

I'm sorry if I'm repeating your questions, but I'm just

trying to make sure that the folks back home can

hear what you say.

Yes?

AUDIENCE: You talked about using patch sets as sort of a

local mechanism for yourself.

Why wouldn't you just clone the repository, and actually

use the real Mercurial revisions to commit, and then

remerge them [UNINTELLIGIBLE]?

BRYAN O'SULLIVAN: So the question was about using a set

of patches for doing development of your own stuff

rather than using regular Mercurial tools in order to

capture all of the history.

And what would you say, that's partly a style thing, and it's

partly a wanting to not clutter the history thing.

So you may have heard me allude to the fact that I go

down blind alleys?

Well, I don't necessarily want my idiocy caught on the

permanent record if I can necessarily avoid it, right?

So it's nice, for example, particularly if you're dealing

with an environment like the Linux kernel, where there is

quite a high standard for your changes to meet in order for

them to get in.

You really want things to be packaged cleanly.

That's one environment where it makes sense to think in

terms of patches, because you want to submit the pristine

final thing, not the 45 different idiot things that

you did along the way.

For my own purposes as well, it helps me to think in terms

of patches because then, I'm both putting

layers on my thought.

I'm linearizing my thought in space and in time, right?

So each patch captures a layer that I'm worrying about.

I can actually revision control the patches

themselves, so I do capture the history of the changes,

but just not the actual repository that they're

eventually going to end up in.

But I also have the ability to go back and erase history and

make myself look better, which is very nice.

Yeah?

AUDIENCE: How well does it handle refactoring

[UNINTELLIGIBLE] and file renames and moving

code, lots of code?

BRYAN O'SULLIVAN: The question is how well Mercurial handles

refactoring, that handles renames,

and moving code around.

So there are three answers to that.

One is that it doesn't, the second is that it does it

really well, and the third is that it'll be

really all there soon.

And that these are all true at the same time.

So right now, Mercurial has a shell script

that handles merging.

So it knows how to figure out the basis for doing a

three-way merge, for example.

And it will hand those off to the shell script.

Now, we have a couple of different

shell scripts in place.

One is a shell script that will run a

three-way merge tool.

But three-way merge is not very satisfying when you're

doing distributed development.

Because you frequently have cases where there are

crisscrossed merges, right?

I pull your changes at the same time that you've pulled

my changes, and we both commit the results of our mergers, we

end with this thing where we have to merge again.

And that can iterate a few times.

In those kinds of cases, you would really like to have some

sort of more history sensitive merge that will cause us to

converge more quickly.

There is a branch of Mercurial that has

that facility available.

In terms of handling renaming, though, right now we track

rename information, and Matt Mackall, who is the guy who

wrote much of the Mercurial code, who originally started

the project, is working on actually having your changes

follow across renames.

And by the way, the question that is sort of implicit there

is if I make an edit to a file, and you've renamed the

file to a different name, you really want the changes to

show up under the different name, after we've resolved our

differences.

So that's not there yet, but the actual machinery

underlying it that's necessary is there, and the future will

be present soon.

Sorry, I'll give somebody else a chance, first. Yeah?

AUDIENCE: What platforms are supported?

BRYAN O'SULLIVAN: What platforms are supported?

Pretty much anything that python runs on, that has a

file system behind it that looks vaguely Unix-y.

So Windows, you name the brand of Unix, and Macs, and so on.

Yeah?

AUDIENCE: How do your shell scripts work on Windows?

BRYAN O'SULLIVAN: The shell scripts work on Windows like

they run as .bat files.

Of course, they're not actual shell scripts there, they're

.bat files.

But nevertheless, they work.

Yes.

Yes?

AUDIENCE: Do you support the partial [UNINTELLIGIBLE]

bring over, whatever?

Partially bringing over?

BRYAN O'SULLIVAN: Do we support partially bringing

stuff over?

Somebody is working on that at the moment.

And there are two different kinds of partially bringing

stuff over, right?

One is, you want maybe only the last 2,000 revisions of

your project history, because you don't want to be carrying

around the [UNINTELLIGIBLE]

gigabytes of earlier stuff that was done ten years ago.

And the other thing is that perhaps you're only interested

in working in a certain portion of the tree, so you

don't have to check out the other ten gigabytes of stuff

that you don't care about.

And by the way, when I say gigabytes, some people are

actually using Mercurial to work on

multi-gigabyte source trees.

For example, the FreeBSD ports tree has a port that sits in

Mercurial instead of in Perforce.

And it contains, I think, something on the order of

150,000 files and 150,000 change sets.

So that's a reasonably large amount of stuff, and you

really want to be able to focus on a certain aspect of

it rather than do all of it.

It's in progress.

Yes?

AUDIENCE: A generalization of the rename scenario is if you

have a block file, let's say you start splitting it up into

[UNINTELLIGIBLE] files.

Will those scripts handle that as well?

BRYAN O'SULLIVAN: Yes.

So the question is, if you split a file up into multiple

files, will Mercurial handle that?

And I can't speak for Matt, because it turns out that I'm

not actually him.

But I believe that his plan of record is to, when a rename or

copy has been detected, do a merge into each of the

children, the descendants of the original file.

So if you copy one file into three different files, and

then you whack off the first third, middle third, and the

final third in those three different files, somebody else

makes an edit in the original file, the logically

appropriate thing ought to happen.

Now, I'm not actually doing the implementation.

So I can say yes, it will be a better world, and everybody

will be happy.

I don't know how hard it's going to be.

Yes.

AUDIENCE: On a smaller scale, can you customize Mercurial so

that a small number of [INADUBLE PHRASE]

so that they can, when they do updates,

[UNINTELLIGIBLE PHRASE]

if people are writing to it, and people are reading what

other people have written, but they don't need to worry about

merge, or anything like that?

BRYAN O'SULLIVAN: Sure.

So the question is, can you extend Mercurial so that it

essentially behaves like CVS, so that when you do an update,

it does the logical equivalent of a pull, so that you don't

have to worry about merging so that you have a

straightforward, simple way for people to

get their feet wet.

And the answer is, yes, Mercurial is extensible in

those terms. No, nobody has explicitly done

the work to do that.

I do know that there's at least one other distributed

revision control system that does exactly what you

described, because they want to give people who have used

CVS essentially training wheels.

So it is possible to do that, and would be quite

straightforward.

Contributions of code to do things like

that are always welcome.

Yes?

AUDIENCE: Can [UNINTELLIGIBLE] push a

patch back into history?

BRYAN O'SULLIVAN: Can I push a patch back into history?

AUDIENCE: [UNINTELLIGIBLE]

I released version 2.1, and now I'm

working on version 2.10.

And then I found a bug that's already existed in 2.1.

I pushed [UNINTELLIGIBLE PHRASE]

BRYAN O'SULLIVAN: OK.

So the question here is, let's say I'm working on a revision

2, and a revision 2.1.

And revision 2 I've frozen, because I've released it, and

there are CD-ROMs out in the wild, or tarballs, or whatever

the bits the kids use these days are.

And I found a bug in revision 2, I've fixed it in my 2.1

branch, and I want to backport that fix.

This is something that revision control weenies tend

to call cherry-picking.

I'm a self-labelled weenie, by the way.

This is not a pejorative term.

The answer is you can do it using a patch.

As something that you would support as a first-class

operation, cherry-picking is a very difficult thing to do.

There are maybe two or three revision control systems that

handle it relatively well.

Perforce is one.

Another would be, actually, Subversion almost handles it

well, because Subversion has no notion of

merging at all, right?

Another one would be arch, which is explicitly built in

those terms, but I wouldn't recommend

that anybody use arch.

A final one would be Darcs, which is one of the

theoretically interesting but not practical ones, that is

written in Haskell, of all languages.

And Darcs has this wonderful quantum mechanical, I kid you

not, theory of patches that it is built up on, so that you

can talk in terms of patches commuting with each other, and

boundaries beyond which they cannot go, and

so on and so forth.

And it tends to go exponential in space and time quite

frequently.

So it's got some fundamental theoretical problems that are

not addressable.

Yes?

AUDIENCE: So now it seems with Mercurial that first, an

engineer does some work in his local area, passes it out, and

in a corporate setting, or even in a project setting, you

have to then publish your changes out to the world.

But since it's now, it feels like it's a two-stage commit,

you have to make your changes, and they you have to actually

publish them, it seems like it's easier to accidentally

forget to publish them.

Is there any way to make that less painful?

BRYAN O'SULLIVAN: The question is, is there an easy way to

publish your stuff with Mercurial, or presumably, by

extension, with other distributed tools so that

other people can find them?

And the answer is that right now, you have to do stuff by

hand, because we've been focusing on the core of the

software rather than on these larger usability questions.

That sort of thing, where you want to be able to see, oh, I

haven't actually published this, even though I wanted to,

or oh, this repository that I've made changes based on is

actually, has diverged for me by this much.

You want to be able to tell those kinds of things without

having to explicitly do it by hand all the time.

Those are things I would really love to see, but

they're not quite there yet because we've been preoccupied

with just getting the core functionality into

one .0 form so far.

Do remember that we've only been around for about 14

months, and that the set of core

developers is quite small.

Yes, more questions?

Yeah?

AUDIENCE: Do you have any ideas on how you would use

Mercurial to supplement other sorts of control systems?

BRYAN O'SULLIVAN: The question is, would be possible to

supplement an existing revision control

system using Mercurial?

And the answer is there are various different ways that

you can do that.

So somebody has, for example, written an incremental

Perforce importer for Mercurial.

So there exists a proof that it is possible to do what you

want today.

I don't know how well it works.

AUDIENCE: One directional?

BRYAN O'SULLIVAN: I imagine that it is one

directional, yes.

There is also a tool called Tailor, written by a guy in

Italy whose name is [? Emmanuel Guyfax ?].

And Tailor is sort of the Rosetta Stone of revision

control tools.

It will convert between arbitrary revision control

tools up to a point.

It doesn't have a very good notion of branching or

merging, so it tends to lose information when talking

between distributed revision control tools.

But if what you're looking to do is an incremental

conversion, and then stuff some things back into a host

revision control tool, it is actually pretty nice, and it's

relatively straightforward to use.

AUDIENCE: [INADUBLE PHRASE]

BRYAN O'SULLIVAN: My question is for maintaining patches and

submitting them, whether it would be suitable for that.

In that kind of a case, probably the easiest thing to

do would be to use a Perforce importer to pull your stuff

into Mercurial, maintain things as patches, then commit

them back to the native revision control tool, perhaps

by hand or perhaps by automating it.

I don't speak Perforce very much anymore, so I can't say

whether it would be completely trivial or not.

My imagination tells me that would be a relatively small

amount of scripting to do.

More questions?

MALE SPEAKER: For those of you who have 12:00 meetings, we're

running over our time right now, which is not a problem,

or currently not a problem in this room, but you may have

your own priorities.

BRYAN O'SULLIVAN: Yes?

AUDIENCE: So in the format of Mercurial, is python objects

[UNINTELLIGIBLE]?

BRYAN O'SULLIVAN: The question is, is the Mercurial data

stream python objects?

The answer is no.

And the reason that it's not is that python has been

somewhat willing to change the data stream format, and that's

not a terribly good thing.

Also, it's not very efficient storage mechanism.

Instead, what we do is we explicitly lay out the bytes

ourselves, using tools like struct.pack, and using string

operations, and just plain old write.

So we know exactly what the bytes are supposed to be.

Yes?

AUDIENCE: How does it deal with authorization?

BRYAN O'SULLIVAN: How do we deal with

authorization is the question.

And there are two or three answers to that, depending on

how you want to look at it.

The first is that we have no notion of authorization at

all, because Mercurial doesn't care.

The second, which is a more satisfactory answer, is that

if you want to be able to share changes with other

people, you can push to a shared repository, which you

can use, using, for example, if you're all on the same file

system, Unix groups, or Windows permissions.

You can also tunnel over SSH, so that you can do that over

the insecure internet.

Somebody is in the process of adding support for pushing

changes over HTTP, which will use, I presume, some form of

user authentication, whatever patch he happens to provide,

and will be secured over SSL.

So Mercurial itself doesn't have to care, but it has

various different transports that do allow you to specify

things in different ways.

And for example, there is an extension to Mercurial

available that will let you lock down individual user

accounts, and put [? ackles ?]

on the subtrees that people are allowed to push to.

So if you have changes that push stuff into a tree that

you're not allowed to push to, you will be forbidden from

doing that, and other people won't be able to pull those

changes, because they won't get in the first place.

So there are various different ways that you

can lock things down.

Yes?

AUDIENCE: Is SSH tunneling built-in in Mercurial like it

is in Subversion, or is it manual to open up SSH?

BRYAN O'SULLIVAN: The question is, is SSH tunneling built-in,

and the answer is yes.

We use ssh:// blah-de-blah-de-blah URLs.

Any more questions?

OK, thank you all very much for listening.

The Description of Mercurial Project