LESLIE: mercurial revision control systems, sans slides
for the moment, but soon with slides.
By Bryan's own admission, he builds
large distributed stuff.
BRYAN O'SULLIVAN: Not large by your standards.
LESLIE: Not large by our standards, though.
And I just wanted to let you know this talk is going to go
up on Google Video, so if you have any questions that you
think might contain information that's
particularly Google-y, let's hold those until the end after
the camera is off.
BRYAN O'SULLIVAN: OK, thank you, Leslie.
Let me see what a good distance is here.
So about 12 months ago, I was casting about for a revision
control system that I could use to write the next late,
great desktop email system that nobody was going to use.
And unfortunately, during my passing around the place, I
found that there was nothing entirely suitable to my needs.
So I found a piece of software that was almost suitable,
which had just gotten started on by a guy that I happen to
know, and this was a tool called Mercurial.
And the origins of Mercurial are steeped in the great
BitKeeper debacle of 2005, when Larry McVoy took his
marbles and left the playground.
So at that time, Linus Torvalds was left without a
revision control system, and started writing his own.
At the same time, Matt Mackall started working on a revision
And they converged on fairly similar designs, although in
substantially different ways.
And so now the world has two 14 month old revision control
systems, instead of one, to join a field of a huge pile of
free revision control systems already.
So why am I actually interested in
this particular one?
Why am I in here to talk to about this?
Well, there's a couple of different reasons that it's
interesting to me, here.
And one is that Mercurial kind of matches a few aspects of
the Google state religion, as I
understand it from the outside.
And those are that it's written in python, it is
distributed, and it does things very fast. So these are
kind of nice properties to have.
Now, it's not completely python.
It's only 95% python.
There's a couple of core routines that
are written in c.
But what we've done in the 12 months or so since people have
started actually using this stuff is a fair number of
third party people, both open source and commercial
projects, have actually started using it.
Which is really kind of an interesting vote of
Normally, revision control holds the crown jewels of
whatever it is that you're doing, so it had, by God,
better work properly, or else you're going to have serious
But in that time we've had a couple of interesting people
start to use Mercurial.
We have the XenSource people, those are doing the open
source Linux hypervisor.
We have the One Laptop per Child project, the OpenSolaris
project, the MoinMoin wiki.
A pile of other people who are playing around to various
different extents, some of them large,
some of them small.
But it's pretty fun to be working on something that's
young, and yet, has interesting and
active users already.
So from a developer's point of view, you're sitting down in
front of your computer, and you're about to start your
You're going to work on the next great novel, or the next
great huge source space, or whatever it's going to be.
You're looking for criteria to choose one of these revision
control systems. There's got to be at least 50 of them
extant at the moment.
Ah, thanks, Chuck.
AUDIENCE: We'll see if this actually does the job.
BRYAN O'SULLIVAN: So what are some criteria?
Well, I can tell you what I used, at least, for my own
I wanted something that was straightforward to understand,
so I was going to be able to spend my time thinking about
my problem instead of my revision control problem.
I wanted something that was pretty quick, so that my tea
would still be warm when I was finished with a particular
revision control operation.
And I wanted something that would help me to work
efficiently with other people, that being kind of a
Now all you want to get this onto a Windows box?
I think we're going to be slide-less.
MALE SPEAKER: It'll work.
BRYAN O'SULLIVAN: It'll work?
MALE SPEAKER: Yeah, Open Office, you say?
BRYAN O'SULLIVAN: Yup.
MALE SPEAKER: It'll work.
BRYAN O'SULLIVAN: OK.
So the Mercurial revision control system has a
conceptual model that's very simple.
I've had it described to me as, you can carry it around in
your head, and I like to think of it that way myself.
There are basically three different things that you need
to pay attention to if you're thinking about revision
control, in Mercurial and in many other distributed
revision control systems. The first is that you have a
repository, which is where your stuff lives.
The second is you have a working directory, which is
where the stuff you're working on lives, and the last one is
a change set, which is a snapshot of what's going on.
So in Mercurial terms, a repository is not a heavy
It does not have a database behind it, it's
just a pile of files.
So these are things that are easy to create, they're easy
to administer, you could put one together and blow one away
in a matter of moments.
And that's how people prefer to work.
They are, as I mentioned lightweight, and they are
pretty much everywhere.
Everywhere that you happen to be doing some work in a
working directory, there is a repository that happens to be
wedded to it.
So inside a repository, there are only really three
There's a change log, which says, I've done this work.
There is a manifest, which says, I did the work on these
versions of these files, and then
there's the file metadata.
That's all you need to know to understand the underpinnings
of the revision control software.
It's very, very straightforward.
Now you contrast that model with the internals of a more
traditional large scale revision control system, like
ClearCase, or Perforce, or something like that.
And if the internals are visible to you at all--
they're typically not-- they're going to be a big pile
of different things.
The Mercurial source code is, I think it might have grown to
almost 12,000 lines.
But it's a fully functional system, and it's
12,000 lines of python.
So you can keep all of it in your head as a developer, you
can keep all that you need to know in your head as a user.
It's got a couple of very good properties in that respect.
Now a change set, as I mentioned, is a snapshot of
your project as it stands on a particular point in time.
It corresponds to a revision in Perforce, or Subversion, or
any of these other tools.
And the terminology that we use for creating a change set
is committing it.
Now unfortunately, I have some pretty graphics here that
you're not able to see at the moment about how you go about
creating these things.
And what starts being interesting then is how you go
about merging and branching with people.
So I'll have to hand wave since I have no graphics to
When two people work, they will create clones of each
AUDIENCE: Would a whiteboard help?
BRYAN O'SULLIVAN: A whiteboard would actually help, yeah.
When I start working, I create a repository, and it's just a
little directory called .hg somewhere.
And outside that repository is the actual directory.
Let's say I'm working on a project called foo.
So I have foo/.hg, that's where all my metadata lives.
I don't need to actually care about
anything that's in there.
The working directory, where I have my files, like copying,
and readme, and foo.c lives around this .hg directory, and
that's where I do all my actual work.
So somebody else will make a clone of this, and they'll
have an identical copy of the working directory, and an
identical copy of the .hg directory.
And they start working there, and they
start creating a revision.
And let's say they create revision one.
And I create something, and I call it revision one, too,
because I wouldn't know about their changes, because this is
a distributed system, right?
So we go on, and we create revision two, say, as well.
Now the next thing that we need to do here is we need to
be able to communicate with each other to say, we're
working on the same project, and now we have diverging
views of the world.
How do we cause them to reconverge?
Well, what I do is, I pull my changes from one repository
into the other, and that literally means I just take
the direct [UNINTELLIGIBLE] graph that is all of my
revisions, and I plop them in here.
And after I do a pull, I do a merge.
And what the merge does is it just creates a structure in
the working directory such that I have my revision one,
your revision one, my revision two, your revision two.
And then, in the working directory, the working
directory has a notion of there being parents.
So you can think of the working directory as a
It's the last stuff that I had, the merger of these two
guys, and the stuff that I'm about to commit as a
result of the merge.
So the working directory gets all these things, I do my
commit, and I'm done.
And that's all there is to it.
So a branch in Mercurial is just a revision that has two
Now I haven't actually drawn a branch here.
Look, here's a branch.
This is a revision that has two children, and this, being
a merge, is a revision that has two parents.
That's all a merge is.
So there's no special sauce there.
If you think of a branch and Subversion as being something
very simple, it's just a copy, the analog in Mercurial is
also it's very simple, it's just two revisions that happen
to have the same parent.
And of course, you can have an arbitrary number of branches.
We only allow merges one at a time, because nobody's really
found a good way to explain multi-lane merge to people,
and their heads explode.
Heads exploding, not so good.
So when people work in parallel, and they make these
changes, and they commit them, and then they merge after they
change, that's kind of a nice property to have. Because if
you think back to what you used to have to do, in the
days of CVS, history was linear.
People would do a change, they'd do a
change, they'd do a change.
And if you won that commit race before I managed to make
a commit, I had to merge with your changes before
I could do a commit.
And that meant that there was no permanent
record of my changes.
Now with this model, you don't end up having
that kind of risk.
With CVS, it was very straightforward to shoot
yourself in the foot by screwing up a merge.
It happened all the time.
You'd see changes that got checked in with conflict
markers, badness ensued.
Subversion sort of avoids this, but the default policy
is to work the same way that CVS does.
In Subversion, you have to explicitly choose to work on a
It's the same thing with Perforce.
The distributed tools necessarily don't work that
way, and it's a slightly safer way to do things, because one
default policy is less safe than the other.
In practice, I don't know that makes a huge difference.
You don't hear people saying that I threw away three weeks
of work in Subversion or in Perforce, or in any of these
other tools very frequently.
So with many of these things what you get is a matter of
degree in terms of loss or gain of functionality.
OK, a couple of things that are interesting to know about
Mercurial for getting going and keeping going quickly.
If you're working in a small, lightweight environment, it's
useful to have, for example, a built-in web server, which we
have. So this is a web server that you can both interact
with as a person, and view revisions in your tree, and
annotate things, and download tarballs.
But it's also what Mercurial uses to fetch data and to,
very soon now, push data over the network as well.
It also works over SSH if you happen to prefer security.
In addition, Mercurial has essentially a single way to
stream bytes onto the disk and to stream them off the disk.
We call that abstraction a bundle.
So you can put bundles onto USB drives, or you can send
them around to the email, and you get a complete set of
change history that you can transfer without having to be
online at the time.
This is kind of a useful property for people doing
So sharing is also a, what would you call it, a
In other words, when I make a clone of a repository, and I
start making changes, and I push my changes back to the
I don't know where this buzzing is coming from--
I end up with both repositories being identical
OK, the buzz has stopped.
I can step back and look at my slides again.
MALE SPEAKER: You're probably standing on a wire.
BRYAN O'SULLIVAN: Ah.
So the final thing, and the thing is sort of the special
sauce in Mercurial is that Mercurial is, albeit written
in python, an extremely fast system.
We've benchmarked it under various different scenarios,
and there are only a few other revision control systems that
compare in terms of performance.
Now performance is interesting in its own right, because it
means that you don't get distracted while you're
waiting for the tool to do things.
But it also enables you to do certain kinds of operations
that are not necessarily otherwise possible.
So in order to give a little bit of an apples-to-apples
comparison, the other day I went to the Subversion
self-hosting repository, where they've had Subversion
developed for the past four or five years, and I just did a
Subversion check out of the Subversion source tree.
And the head of the source tree as a working directory is
about 72 megabytes in size.
So out of curiosity, I sucked all of the Subversion history
into Mercurial, and created a local Mercurial repository
that's an identical copy of what's on
the Subversion website.
And it turns out that the Mercurial repository plus
working directory is about the same size as the Subversion
So I have 50,000 revisions and a working directory in 76
megabytes, versus just a working copy in 72 megabytes,
which I thought was kind of interesting.
It means that you don't have to pay much in order to get a
complete history of everything onto your machine, where you
are no longer talking to the network.
One of the things that you've probably noticed if you've
been dealing with Perforce service here is that things
take an awfully long time once your servers are busy.
Central servers don't scale terribly well.
We prefer not to talk to them if you don't have to.
In a distributed tool, you tend to not talk to them very
often at all, maybe a couple of times a week.
And so also out of curiosity, I ran a couple of a very
simple performance tests.
Now the Subversion repository is very small.
It's only about 1,200 files, and the actual working copy,
when you ignore all the .sbn directories and so on, it's
only 25 megabytes in size.
And pound for pound, Mercurial and Subversion worked out
about the same.
There were a few instances where one was faster and a few
instances where the other one was faster.
But what's interesting about one being faster or the other
being faster on a small test case is primarily that you can
write something in python, and have it be as fast as
something that's written in pure c, if you're clever about
how you do it.
And I assert without immediate proof that in fact, Mercurial
will scale to large projects better than many revision
control systems that are written in c, because of the
underlying obstructions that we use.
And the implementation techniques, might I add.
So there are a couple of different things that we do in
order to make the implementation go fast. The
primary one is that we've desperately tried, at every
step, to avoid seeks.
Disk seeks are not your friends.
Disk seeks are things that cause you to just sit around
and wait for good things to happen.
Good things that happen are streaming I/O
linearly off your disk.
That's what we really like.
So in order to stream I/O linearly off your disk, there
are a couple of things that you would like
to be able to do.
The first is you don't want to write more than you have to,
and the second is that you don't want to read more than
you have to.
So what are the necessary properties for revision
control systems to not write more than you have to?
Well, it's been dogma in revision control for a while
that what you really want to do is you want to store the
most recent revision of your file as the very first thing
on the disk, and then everything else wants to be
reverse deltas based on that.
Because that means you get the nice property of something
that you've accessed recently you can just read with a
Now, Mercurial sort of turns that a little bit on its head.
We do forward deltas from the very first revision.
But that sounds like it's a terrible implementation plan.
It's actually rather better than that, because what we do
is we have 0 of 1 retrieval properties.
So instead of having a very first revision and 10,000,000
little tiny deltas on top of that that you have to
reconstruct the final revision out of, what we do is we have
two different techniques.
One is that every so often, when the accumulated quantity
of stuff that you've deltaed gets to be too big, we
store a full text.
So you pay something of an extra space cost on disk, but
you end up with 0 of 1 retrieval properties.
The other thing that we do is, rather than applying each
delta as we go, we compose our deltas, and then we just apply
one single union delta at the end.
And that gets rid of some of the nasty properties that you
have when you're composing deltas and [? munching ?]
strings in python.
So these are two things that make it quite fast and quite
efficient to get nice linear accesses.
Another thing that you can do that gives you linear
accesses, at something of a cost in space is, you might
think that if you're doing a delta, you want to do a delta
against your parent.
So if I'm at revision three, and I have a child that is
revision seven, that is really based on revision three, I
want do to a delta against revision three?
Because if you're doing a delta against revision three
and you're revision seven, there
might be a seek involved.
So what if at revision seven, you do a delta against
provision six, and you don't actually care whether revision
six was related to you all?
Well, in that case, you pay something of a greater space
penalty, but you end up with a linear space, or pardon me, a
linear disk access, and a guaranteed lower
probability of seeks.
So again, it's another case of make the seeks go away, do
linear stuff instead, even if it costs me slightly more
space, even if it looks like it ought to
be a less good choice.
Many times these are not the case when you subject them to
a little bit of inspection.
Right, back to my hints as to where I am.
The file formats that we use are very straightforward.
They're binary files, but they're easy to parse in
python, and the reason that they're easy to parse in
python is that we use the [? struct ?]
unpack and pack methods a lot, and we use the string
splitting and unsplitting a lot.
And the reason that these are good things to you is that
they don't go through the python interpreter at all.
They go straight from the python interpreter into c.
Fast things occur, and you get dropped back into python land.
We also try and avoid things like conditionals in our inner
loops, because those tend to cost in performance as well.
So if you look at the Mercurial inner loops, they
tend to be quite tight.
They're still written in python, but they do very, very
simple things, and then they deal in terms of two
[? bulls ?] or a raise, or whatever happens to come out
the other side.
We're out of luck?
MALE SPEAKER: I can open the file, but it doesn't show me
BRYAN O'SULLIVAN: You have Open Office 1, and I
have Open Office 2.
MALE SPEAKER: That's OK.
BRYAN O'SULLIVAN: OK, it continues to be a mime show
for the rest of the presentation, I'm afraid.
The final thing that's kind of interesting about
implementation techniques is that I
mentioned avoiding reads.
Well, why read at all if you can stat?
What if you were to store, instead of just spewing your
files out onto disk, you also stored the information as to
what the stat of each file was when you wrote a the file?
Now, what do I mean by that?
I mean, you store the modification time, the size,
and the access time of each file, and the owner.
And then, when you're looking through a tree to see what was
the last thing that happened to my tree, instead of reading
the file, you do a stat of the file.
You look at the last time you've started the file, you
see if they are are different in any way, and then you a
read to see if you really need to say yes, this file has been
modified, or no.
So what this means is in a 40,000 file tree, Mercurial
does a stat 40,000 times, but it doesn't read and
reconstruct each file 40,000--
or it doesn't really reconstruct 40,000 files.
Again, this is a significant space win when compared to
doing things by hand.
So I've mentioned that things are fast. Yes?
AUDIENCE: It sounds like you are designing for local disks,
not an NFS home directory.
BRYAN O'SULLIVAN: Things are not as fast over
NFS, this is true.
So an obvious place that you can make things fast over NFS
is by using an implementation technique like Perforce, where
you have to tell a server everything that you're doing.
That's a potentially less friendly thing to do.
I was able to live with p4 edit back when I had to use
Perforce, but having to not do it, and being able to do
things on a local disk happens to suit me pretty well.
So speed is an end in itself.
All other things being equal, it's nice to have something
fast instead of something slow.
But what speed also lets you do is it lets you do things
that are not straightforward using other tools.
So Mercurial was originally developed by kernel hackers,
and the kernel is a reasonably large tree.
It's not super large, it's like 20,000 files, or so.
But one of the things that kernel developers tend to have
to do a lot is deal with patches, because there are
certain gatekeepers for pieces of subsystems, and you may be
developing something that isn't ready to go off to
somebody else yet, so you maintain a pile of patches
that your software has to work with.
And working with patches is traditionally a kind of a
painful thing, because revision control tools don't
normally have a concept of patches.
But what if you did have a concept of patches?
Well, Mercurial has an extension
called Mercurial Queues.
If you've been working with open source tools that you
have to patch in order to get working, you may have come
across a tool called Quilt.
And Quilt basically lets you maintain a stack of patches on
top off a source tree.
It doesn't care what the source tree is, it has no
notion of there being an underlying revision control
system of any kind.
So Quilt has the nice property that it will
sit on top of anything.
It doesn't matter whether it's Perforce, CVS,
or an exploded tarball.
And what Quilt lets you do is it lets you push a pass onto
your stack, edit some files, refresh the patch, push
another patch on top of to your stack, refresh the files,
et cetera, et cetera, et cetera.
You can pop and push, and work on the top of stack
Mercurial Queues works the same way, but it's integrated
So what this means is after you've pushed a pile of stuff,
patches, onto your stack, you can now use the regular log
commands, or the annotate command, to find out which
revision that has turned into a change set in the tree made
a particular change.
When you pop your patches, the change sets go away, and once
you're done, you can push the change sets off to somebody
else as regular Mercurial revisions, and they see them
as regular Mercurial revisions, and then they start
distributing your changes, and goodness occurs.
But the nice thing about this is the
integrated nature of it.
So for example, if you're doing a bug search, one of the
things that people have started using in the past year
or two with tools like Git and Mercurial is dichotomic search
of your revision graph.
Now what that means is you're essentially doing a
I know I have a revision that was bad.
I know I have a revision that was good.
And I want to narrow my way down to the revision that was
the thing that caused the stick to flip, as
quickly as I can.
So if you can do this using your revision control tools
rather than having to worry about the join, the boundary,
between where my patches start and where my revisions end, if
everything is just seamless and you don't have to worry,
that's kind of a benefit.
As an example of how well Mercurial Queues works, Andrew
Morton maintains a pile of patches against the Linux
kernel that is, I think it's about 1,500 patches at the
moment that are just a single Quilt patch series.
I can apply those on top of a Linux kernel repository in, I
think, something on the order of three minutes.
So I can create seven change sets per second for three
minutes to get all of these patches into my tree.
And then, as far as Mercurial is concerned, it's just a
regular old tree that I'm working in that has 1,500 new
revisions in there that I can deal with, that I can serve
up, that I can communicate with other people, but that I
can also modify after the fact by popping,
editing, and repushing.
And then I can continue to share those changes with a
Quilt user, because Mercurial Queues is Quilt compatible.
This is a fairly tremendous win, not just for dealing with
upstream software projects, but also for prototyping and
developing your own code.
Quite often, if I'm working on a feature, what I'll do is I
won't actually start committing changes, because
I'm kind of an idiot.
I don't tend to know where I'm going a lot of the time.
I'll start off, and I'll go down a blind alley, and I'll
go down another two blind alleys along the way, and
after a couple of days I'll have converged on something
that looks like a solution.
But along the way, what I'll have done was I left fault in
terms of here is the underpinnings of
my work, one patch.
Here is another thing that I need to have in
place, another patch.
And at each point when I'm refactoring my code, or I'm
moving something between one layer to another layer, I just
moved [? hunts ?] from one patch into another patch, and
I push and pop my context at various different times to be
working on a different patch at each point.
So this is not just for working with other people.
It's for collaborating with yourself as your clue
evolves over time.
A very good property to have.
So finally, there are a couple of things that are very
interesting to me about distributed revision control
that are not specific to Mercurial, but that might be
of interest to people who do things with free software and
open source tools in general.
I have this idea that choosing a particular revision control
tool is actually making a statement about how you want
your project to evolve, right?
So if you work in a large company, and everybody has
essentially a level playing field, everybody is more or
less likely to have commit access to much the same stuff.
And everybody can pull the same changes, everybody can
push the same changes, or integrate changes, or whatever
the particular tool's language lets you do.
But out in the open source world, that's really not the
If you're using a tool like CVS, or you're using a tool
like Subversion, there's a world of haves, and there's a
world of have-nots.
There are the people who have commit access to the one
central repository that everybody has to use, and then
there are the people who can maybe read from that
repository, they can check out a working copy,
but they can't commit.
And they may not be able to earn the right to commit until
they've proven themselves, over the course of a number of
patches that they've submitted and had to maintain.
Now if you've been in the position of having to maintain
a patch against an upstream source tree, it's kind of a
painful thing to have to do.
Tools like Quilt will make it more straightforward, but
really, what you would like to be able to do is work with
other people, speaking the same language,
using the same tools.
With a distributed tool, you can do that.
With a centralized tool, unless they're willing to
create a little sandbox that people can go wild in, which
Wiki's history, and the history of other collaborative
commons on that have proven as not necessarily
very scalable thing.
They can't do.
So you think of it as being analogous to the ascent of
In the days of RCS and SCCS, everybody crawled on all fours
and they had to be on the same machine in order to
get any work done.
And then, suddenly, everybody's tail fell off and
they started going around, and hunched backs, and they were
able to talk to the central repository over the network.
But they had to be on the network in order to
get any work done.
If I unplug your workstation from the Perforce server,
there's nothing you can do.
If I unplug your workstation from a Subversion server, you
can run diff, and nothing else.
With the distributed tool, that's no longer the case.
I can work on the train, I can work on a mountaintop.
So long as I have history, and I can access my
hard disk, I'm set.
So I can work anywhere, I can contribute with, to any
project, I can work with anybody with
a distributed tool.
The tools don't make the boundary, it's the social
norms of your project that you explicitly choose that make
You choose the model yourself with a distributed tool.
You don't have imposed on you by the technology.
So I would encourage you to give this stuff a try.
Mercurial, as I said, it lets you work as a centralized
system, if that's what you prefer.
If you want to work in a distributed fashion, you can.
And one of the reasons that people have cited to me a few
times for using centralized tools is that
they're afraid of forks.
So Guido works here, for example, and he does not like
Stackless Python at all.
Stackless Python is this project that started about
eight years ago, that went off in a direction that was
fundamentally different to the way he wanted to
bring python itself.
And for example, the GCC people have had the same
EGC has forked off from GCC many years ago, and they
eventually managed to reconcile their differences
and start working together.
But what's interesting about using a tool that has good
support for merging and good support for branching is that,
everybody forks all the time, right?
Forking is just what you do.
And merging is what you do when you're done forking.
So once somebody has decided that they want to play nice
again, and they want to cooperate with
you, you just merge.
Whereas in a central tool, what somebody's got to do is
they've had to suck all the history of your central
They've had to shove it all back into another central
If you want to reconcile your differences, you've got a
serious problem on your hands all of the sudden, because
there's no way to make the two communicate.
That's just not the case with a distributed tool.
So what would you say, it makes it easier to reconcile
A final couple of comments that I have
before we finish off.
There's a number of people who work on Subversion here at
Google, and I've been very conscious along the way, as
I've been trying to shepherd people into sending patches
into Mercurial, and so on, of the great job the Subversion
people have done in terms of building a good community
around their tool.
If you're working on open source software, the only
thing you have going for you is A, technical merit, and B,
And Karl Fogel and Jim Blandy, and Brian Fitzpatrick, and Ben
Colinsussman, and Garret Rooney, and all those other
people who've worked on Subversion over the years,
they've done a very good job of making themselves
accessible, and making the Subversion community being a
place that is a good place to contribute to.
You send a patch in, somebody's going to review in
and say, yeah, could you tweak this, yeah,
could you tweak that.
It's a nice properties to have.
And I've been very explicit in trying to emulate their
example as we've been building Mercurial, because it's always
nice to try and learn from somebody else's good examples,
rather than to try and blaze a trail of your own.
And that's, I think, stood us in good stead.
I actually ran a survey of our users a couple of months ago,
just to get a sense of where people thought we were.
It's very easy when you're a developer to stay down in the
trenches and look at the next line of code you need to
write, or look at the next patch you need to issue, or
the next bug that you need to fix.
But it's nice to get a sense of what your users think about
you, and people have been pretty complimentary about us.
People are also quite happy with the software, in terms of
the fact that it's easy to install, easy to use, not too
different from tools like CVS and Subversion.
And it's been very rewarding to actually be able to talk to
people and say, look, here's this nice shiny toy that we
have, that you can use for something, be it
small or be it large.
It'll scale, and it'll work for you across all of these
different sizes of thing.
That comes to the end of my prepared comments.
Thanks very much for bearing with me as I've had to
essentially hand wave my way through.
And I'd be happy to take any questions people have. Yeah?
AUDIENCE: So [INADUBLE PHRASE]
BRYAN O'SULLIVAN: So the question is, what's the
computational cost of a merge when one person does a large
amount of work, and another person does a
small amount of work?
And the answer is that there's not very much cost to it.
So bringing in changes from the outside is essentially a
linear operation in the number of changes that you've made.
So cloning a repository, all 20,000 revisions or just
pulling two changes, they have approximately linear costs.
So one costs about 10,000 times as much as the other.
So if you've done 10,000 things and I've done two, and
I pull in your 10,000 things, it takes me about 10,000 units
of time to process those changes.
The actual merge afterwards is primarily a matter of updating
the working directory.
Most of your changes are not going to conflict with most of
my changes, and then what happens with the few conflicts
that there are is really, it's almost not a matter for
We have a couple of different merge strategies that people
can use when there are conflicts.
AUDIENCE: But can you tell the common
ancestor between the branches?
BRYAN O'SULLIVAN: Yes.
Yes, can you tell the original common ancestor?
Yes you can.
I'm sorry if I'm repeating your questions, but I'm just
trying to make sure that the folks back home can
hear what you say.
AUDIENCE: You talked about using patch sets as sort of a
local mechanism for yourself.
Why wouldn't you just clone the repository, and actually
use the real Mercurial revisions to commit, and then
remerge them [UNINTELLIGIBLE]?
BRYAN O'SULLIVAN: So the question was about using a set
of patches for doing development of your own stuff
rather than using regular Mercurial tools in order to
capture all of the history.
And what would you say, that's partly a style thing, and it's
partly a wanting to not clutter the history thing.
So you may have heard me allude to the fact that I go
down blind alleys?
Well, I don't necessarily want my idiocy caught on the
permanent record if I can necessarily avoid it, right?
So it's nice, for example, particularly if you're dealing
with an environment like the Linux kernel, where there is
quite a high standard for your changes to meet in order for
them to get in.
You really want things to be packaged cleanly.
That's one environment where it makes sense to think in
terms of patches, because you want to submit the pristine
final thing, not the 45 different idiot things that
you did along the way.
For my own purposes as well, it helps me to think in terms
of patches because then, I'm both putting
layers on my thought.
I'm linearizing my thought in space and in time, right?
So each patch captures a layer that I'm worrying about.
I can actually revision control the patches
themselves, so I do capture the history of the changes,
but just not the actual repository that they're
eventually going to end up in.
But I also have the ability to go back and erase history and
make myself look better, which is very nice.
AUDIENCE: How well does it handle refactoring
[UNINTELLIGIBLE] and file renames and moving
code, lots of code?
BRYAN O'SULLIVAN: The question is how well Mercurial handles
refactoring, that handles renames,
and moving code around.
So there are three answers to that.
One is that it doesn't, the second is that it does it
really well, and the third is that it'll be
really all there soon.
And that these are all true at the same time.
So right now, Mercurial has a shell script
that handles merging.
So it knows how to figure out the basis for doing a
three-way merge, for example.
And it will hand those off to the shell script.
Now, we have a couple of different
shell scripts in place.
One is a shell script that will run a
three-way merge tool.
But three-way merge is not very satisfying when you're
doing distributed development.
Because you frequently have cases where there are
crisscrossed merges, right?
I pull your changes at the same time that you've pulled
my changes, and we both commit the results of our mergers, we
end with this thing where we have to merge again.
And that can iterate a few times.
In those kinds of cases, you would really like to have some
sort of more history sensitive merge that will cause us to
converge more quickly.
There is a branch of Mercurial that has
that facility available.
In terms of handling renaming, though, right now we track
rename information, and Matt Mackall, who is the guy who
wrote much of the Mercurial code, who originally started
the project, is working on actually having your changes
follow across renames.
And by the way, the question that is sort of implicit there
is if I make an edit to a file, and you've renamed the
file to a different name, you really want the changes to
show up under the different name, after we've resolved our
So that's not there yet, but the actual machinery
underlying it that's necessary is there, and the future will
be present soon.
Sorry, I'll give somebody else a chance, first. Yeah?
AUDIENCE: What platforms are supported?
BRYAN O'SULLIVAN: What platforms are supported?
Pretty much anything that python runs on, that has a
file system behind it that looks vaguely Unix-y.
So Windows, you name the brand of Unix, and Macs, and so on.
AUDIENCE: How do your shell scripts work on Windows?
BRYAN O'SULLIVAN: The shell scripts work on Windows like
they run as .bat files.
Of course, they're not actual shell scripts there, they're
But nevertheless, they work.
AUDIENCE: Do you support the partial [UNINTELLIGIBLE]
bring over, whatever?
Partially bringing over?
BRYAN O'SULLIVAN: Do we support partially bringing
Somebody is working on that at the moment.
And there are two different kinds of partially bringing
stuff over, right?
One is, you want maybe only the last 2,000 revisions of
your project history, because you don't want to be carrying
around the [UNINTELLIGIBLE]
gigabytes of earlier stuff that was done ten years ago.
And the other thing is that perhaps you're only interested
in working in a certain portion of the tree, so you
don't have to check out the other ten gigabytes of stuff
that you don't care about.
And by the way, when I say gigabytes, some people are
actually using Mercurial to work on
multi-gigabyte source trees.
For example, the FreeBSD ports tree has a port that sits in
Mercurial instead of in Perforce.
And it contains, I think, something on the order of
150,000 files and 150,000 change sets.
So that's a reasonably large amount of stuff, and you
really want to be able to focus on a certain aspect of
it rather than do all of it.
It's in progress.
AUDIENCE: A generalization of the rename scenario is if you
have a block file, let's say you start splitting it up into
Will those scripts handle that as well?
BRYAN O'SULLIVAN: Yes.
So the question is, if you split a file up into multiple
files, will Mercurial handle that?
And I can't speak for Matt, because it turns out that I'm
not actually him.
But I believe that his plan of record is to, when a rename or
copy has been detected, do a merge into each of the
children, the descendants of the original file.
So if you copy one file into three different files, and
then you whack off the first third, middle third, and the
final third in those three different files, somebody else
makes an edit in the original file, the logically
appropriate thing ought to happen.
Now, I'm not actually doing the implementation.
So I can say yes, it will be a better world, and everybody
will be happy.
I don't know how hard it's going to be.
AUDIENCE: On a smaller scale, can you customize Mercurial so
that a small number of [INADUBLE PHRASE]
so that they can, when they do updates,
if people are writing to it, and people are reading what
other people have written, but they don't need to worry about
merge, or anything like that?
BRYAN O'SULLIVAN: Sure.
So the question is, can you extend Mercurial so that it
essentially behaves like CVS, so that when you do an update,
it does the logical equivalent of a pull, so that you don't
have to worry about merging so that you have a
straightforward, simple way for people to
get their feet wet.
And the answer is, yes, Mercurial is extensible in
those terms. No, nobody has explicitly done
the work to do that.
I do know that there's at least one other distributed
revision control system that does exactly what you
described, because they want to give people who have used
CVS essentially training wheels.
So it is possible to do that, and would be quite
Contributions of code to do things like
that are always welcome.
AUDIENCE: Can [UNINTELLIGIBLE] push a
patch back into history?
BRYAN O'SULLIVAN: Can I push a patch back into history?
I released version 2.1, and now I'm
working on version 2.10.
And then I found a bug that's already existed in 2.1.
I pushed [UNINTELLIGIBLE PHRASE]
BRYAN O'SULLIVAN: OK.
So the question here is, let's say I'm working on a revision
2, and a revision 2.1.
And revision 2 I've frozen, because I've released it, and
there are CD-ROMs out in the wild, or tarballs, or whatever
the bits the kids use these days are.
And I found a bug in revision 2, I've fixed it in my 2.1
branch, and I want to backport that fix.
This is something that revision control weenies tend
to call cherry-picking.
I'm a self-labelled weenie, by the way.
This is not a pejorative term.
The answer is you can do it using a patch.
As something that you would support as a first-class
operation, cherry-picking is a very difficult thing to do.
There are maybe two or three revision control systems that
handle it relatively well.
Perforce is one.
Another would be, actually, Subversion almost handles it
well, because Subversion has no notion of
merging at all, right?
Another one would be arch, which is explicitly built in
those terms, but I wouldn't recommend
that anybody use arch.
A final one would be Darcs, which is one of the
theoretically interesting but not practical ones, that is
written in Haskell, of all languages.
And Darcs has this wonderful quantum mechanical, I kid you
not, theory of patches that it is built up on, so that you
can talk in terms of patches commuting with each other, and
boundaries beyond which they cannot go, and
so on and so forth.
And it tends to go exponential in space and time quite
So it's got some fundamental theoretical problems that are
AUDIENCE: So now it seems with Mercurial that first, an
engineer does some work in his local area, passes it out, and
in a corporate setting, or even in a project setting, you
have to then publish your changes out to the world.
But since it's now, it feels like it's a two-stage commit,
you have to make your changes, and they you have to actually
publish them, it seems like it's easier to accidentally
forget to publish them.
Is there any way to make that less painful?
BRYAN O'SULLIVAN: The question is, is there an easy way to
publish your stuff with Mercurial, or presumably, by
extension, with other distributed tools so that
other people can find them?
And the answer is that right now, you have to do stuff by
hand, because we've been focusing on the core of the
software rather than on these larger usability questions.
That sort of thing, where you want to be able to see, oh, I
haven't actually published this, even though I wanted to,
or oh, this repository that I've made changes based on is
actually, has diverged for me by this much.
You want to be able to tell those kinds of things without
having to explicitly do it by hand all the time.
Those are things I would really love to see, but
they're not quite there yet because we've been preoccupied
with just getting the core functionality into
one .0 form so far.
Do remember that we've only been around for about 14
months, and that the set of core
developers is quite small.
Yes, more questions?
AUDIENCE: Do you have any ideas on how you would use
Mercurial to supplement other sorts of control systems?
BRYAN O'SULLIVAN: The question is, would be possible to
supplement an existing revision control
system using Mercurial?
And the answer is there are various different ways that
you can do that.
So somebody has, for example, written an incremental
Perforce importer for Mercurial.
So there exists a proof that it is possible to do what you
I don't know how well it works.
AUDIENCE: One directional?
BRYAN O'SULLIVAN: I imagine that it is one
There is also a tool called Tailor, written by a guy in
Italy whose name is [? Emmanuel Guyfax ?].
And Tailor is sort of the Rosetta Stone of revision
It will convert between arbitrary revision control
tools up to a point.
It doesn't have a very good notion of branching or
merging, so it tends to lose information when talking
between distributed revision control tools.
But if what you're looking to do is an incremental
conversion, and then stuff some things back into a host
revision control tool, it is actually pretty nice, and it's
relatively straightforward to use.
AUDIENCE: [INADUBLE PHRASE]
BRYAN O'SULLIVAN: My question is for maintaining patches and
submitting them, whether it would be suitable for that.
In that kind of a case, probably the easiest thing to
do would be to use a Perforce importer to pull your stuff
into Mercurial, maintain things as patches, then commit
them back to the native revision control tool, perhaps
by hand or perhaps by automating it.
I don't speak Perforce very much anymore, so I can't say
whether it would be completely trivial or not.
My imagination tells me that would be a relatively small
amount of scripting to do.
MALE SPEAKER: For those of you who have 12:00 meetings, we're
running over our time right now, which is not a problem,
or currently not a problem in this room, but you may have
your own priorities.
BRYAN O'SULLIVAN: Yes?
AUDIENCE: So in the format of Mercurial, is python objects
BRYAN O'SULLIVAN: The question is, is the Mercurial data
stream python objects?
The answer is no.
And the reason that it's not is that python has been
somewhat willing to change the data stream format, and that's
not a terribly good thing.
Also, it's not very efficient storage mechanism.
Instead, what we do is we explicitly lay out the bytes
ourselves, using tools like struct.pack, and using string
operations, and just plain old write.
So we know exactly what the bytes are supposed to be.
AUDIENCE: How does it deal with authorization?
BRYAN O'SULLIVAN: How do we deal with
authorization is the question.
And there are two or three answers to that, depending on
how you want to look at it.
The first is that we have no notion of authorization at
all, because Mercurial doesn't care.
The second, which is a more satisfactory answer, is that
if you want to be able to share changes with other
people, you can push to a shared repository, which you
can use, using, for example, if you're all on the same file
system, Unix groups, or Windows permissions.
You can also tunnel over SSH, so that you can do that over
the insecure internet.
Somebody is in the process of adding support for pushing
changes over HTTP, which will use, I presume, some form of
user authentication, whatever patch he happens to provide,
and will be secured over SSL.
So Mercurial itself doesn't have to care, but it has
various different transports that do allow you to specify
things in different ways.
And for example, there is an extension to Mercurial
available that will let you lock down individual user
accounts, and put [? ackles ?]
on the subtrees that people are allowed to push to.
So if you have changes that push stuff into a tree that
you're not allowed to push to, you will be forbidden from
doing that, and other people won't be able to pull those
changes, because they won't get in the first place.
So there are various different ways that you
can lock things down.
AUDIENCE: Is SSH tunneling built-in in Mercurial like it
is in Subversion, or is it manual to open up SSH?
BRYAN O'SULLIVAN: The question is, is SSH tunneling built-in,
and the answer is yes.
We use ssh:// blah-de-blah-de-blah URLs.
Any more questions?
OK, thank you all very much for listening.