So it's my pleasure to introduce to you Geoff Hinton, who is a pioneer in machine learning
and neural nets, and more recently [INDISTINCT] architectures. And then, I think that's going
to be topic of today. So take it over. >> HINTON: Okay. So I gave it to Okeer a couple
of years ago. And the first 10 minutes or so will be an overview of what I said there,
and then I'll talk about the new stuff. The new stuff consists of a better learning module.
It allows you to learn better in all sorts of different things, like, learning how images
transform, learning how people walk, and learning object recognition. So the basic learning
module consists of some variables that represent things like pixels, and these will be binary
variables for that. Some variables that represent--these are latent variables, they're also going to
be binary. And there's a bipartite connectivity, so these guys are connected to each other.
And that makes it very easy if I give you the states of the visible variables to infer
the states of the hidden variables. They're all independent given the visible variables
because it's a non-directed graph. And the input procedure just says, the probability
of turning on hidden unit "hj" given this visible vector "v" is the logistic function
of the total what he gets from his audience, so very simple for the hidden variables. Given
the hidden variables, we can also infer the visible variables very simply. And if we want
some--if we put some weights on the connections and we want to know what this model believes,
we can just go back and then forward inferring all the hidden variables in parallel than
all the visible ones. Do that for a long a time, and then you'll see examples of the
kinds of things it likes to believe. And the end of learning is going to be to get it to
like to believe the kinds of things that actually happen. So this thing is governed by an energy
function that has given the weights on the connections. The energy of a visible plus
a hidden vector is the sum overall connections of the weight if both the visible and hidden
units are active. So I'm going to pick some of the features that are active, you adding
the weight, and if it's a big positive weight, that's low energy, which is good. So, it's
a happy network. This has nice derivatives. If you differentiate it with respect to the
weights, you get this product of the visible and hidden activity. And so, that the derivative
is going to show up a lot in the learning because that derivative is how you change
the energy of a combined configuration of visible and hidden units. The probability
of a combined configuration, given the energy function, is E to the minus the energy of
that combined configuration normalized by the partition function. And if you want to
know the probability of a particular visible vector, you have to sum all the hidden vectors
that might go with it and that's the probability of visible vector. If you want to change the
weights to make this probability higher, you always need to lower the energies of combinations
of visible vector on the hidden vector that would like to go with it and raise the energies
of all other combinations, so you decrease the computation. The correct maximum likelihood
learning rule that is if I want to change the weights so as to increase the log probability,
that this network would generate the vector "v" when I let it just sort of fantasize the
things it like to believe in is a nice simple form. It's just the difference of two correlations.
So even though it depends on all the other weights, it shows up this is difference of
correlations. And what you do is you take your data, you activate the hidden units,
that's to classify the units, and then we construct, activate, we construct activate.
So this is a mark of chain. You run it for a long time, so you forgot where you started.
And then you measure the correlation there, start with the correlation here. And what
you're really doing is saying, "By changing the weights in proportion to that, I'm lowering
the energy of this visible vector with whatever hidden vector it chose. By doing the opposite
here, I'm raising the energy, the things I fantasize." And so, what I'm trying to do
is believe in the data and not believe in what the model believes in. Eventually, this
correlation will be the same as that one. In most case, nothing will happen because
it will believe in the data. In terms that you can get a much quicker learning algorithm
where you can just go on and [INDISTINCT] again, and you take this difference of correlations.
Justifying that is hard but the main justification is it works and it's quick. The reason this
module is interesting, the main reason it's interesting is you can stack them up. That
is for accompanied reason you're not going to go into it, it works very well to train
the module then take that activities of the feature detectors, treat them as so they were
data, and train another module on top of that. So the first module is trying to model what's
going on in the pixels by using these feature detectors. And the feature detectors would
tend to be highly correlated. The second model is trying to model a correlation among feature
detectors. And you can guarantee that if you do that right, every time you go up a level,
you get a better model of the data. Actually, you can guarantee that the first time you
go up a level. For further levels, all you can guarantee is that there's a bound on how
good your model of the data is. And every time we add another level, that bound improves
if we had it right. Having got this guarantee that something good is happening as we add
more levels, we then violate all the conditions of mathematics and just add more levels in
sort of [INDISTINCT] way because we know good things are going to happen and then we justify
by the fact that good things do happen. This allows us to learn many lesser feature detectors
entirely unsupervised just to model instruction of the data. Once we've done that, you can't
get that accepted in a machine learning conference because you have to do discrimination to be
accepted in a machine learning conference. So once you've done that, you add some decision
units to the top and you learn the connections discriminatively between the top-layer features
and the decision units, and then if you want you can go back and fine-tune all of the connections
using backpropagation. That overcomes the limit of backpropagation which is there's
not much information in the label and it can only learn on label data. These things can
learn on large amounts of unlabeled data. After they've learned, they you add these
units at the top and backpropagate from this small amount of label data, and that's not
designing the feature detectors anymore. As you probably know at Google, designing feature
detectors is the art of things and you'd like to design feature detectors based on what's
in the data, not based on having to produce labeled data. So the edge of backpropagation
was design your feature detectors so you're good at getting the right answer. The idea
here is design your feature detectors to be good at modeling whatever is going on in the
data. Once you've done that, just have a so slightly fine-tune and so you better get right
answer. But don't try and use the answer to design feature detectors. And Yoshua Bengio's
lab has done lots of work showing that this gives you better minima than just doing backpropagation.
And what's more minima in completing different part of the space? So just to summarize this
section, I think this is the most important slide in the talk because it says, "What's
wrong with million machine learning up to a few years ago." What people in machine learning
would try to do is learn the mapping from an image to a label. And now, it would be
a fine thing to do if you felt that images and labels are rows in the following way.
The stuff and it gives rise to images and then the images give rise to the labels. Given
the image, the labels don't depend on the stuff. But you don't really believe that.
You only believe that if a label is something like the parity of the pixels in the image.
What you really believe is the stuff that gives rise to images and then the labels that
goes with images because of the stuff not because of the image. So it's a cow in a field
and you say cow. Now, if I just say cow to you, you don't know whether the cow is brown
or black, or upright or dead, or far way. If I show an image of the cow, you know all
those things. So this is a very high bandwidth path, this is a very low bandwidth path. On
the right way to associate labels with images is to first learn to invert this high bandwidth
path. And we can currently do that because vision works basically. The first store you
look at then, you see things. And it's not like it might be a cow, it might be an elephant,
it might be electric theater. Basically, you get it right nearly all the time. And so we
can invert that pathway. Having learned to do that, we can then learn what things are
called. But you get the concept of a cow not from the name, but from seeing what's going
on in the world. And that's what we're doing and then later as I say from the label. Now,
I need to do one slight modification to the basic module which is I had binary units as
the observables. Now, we want to have linear units with Gaussian noise. So we just change
the energy function of it. And the energy now says, "I got a kind of parabolic containment
here." Each of these linear visible units has a bias which is like its mean. And it
would like to sit here and moving away from that [INDISTINCT] energy. The parabola is
the negative log of the Gaussian [INDISTINCT]. And then the input that comes from the hidden
units, this is just vi, hj, wij, but Vs have to be scaled by the standard deviation of
the Gaussian there. If I ask--if I differentiate that with respect to a visible activity, then
what I get is hj, wij divided by the sigma I. And that's like an energy gradient. And
what the visible unit does when you reconstruct is it tries to compromise between wanting
to sit around here and wanting to satisfy this energy gradient, so it goes to the place
where this two gradients [INDISTINCT] opposite and you have--that's the most likely value
and then you [INDISTINCT] there. So with that small modification we can now deal with real
value data with binary latent variables and we have an efficient learning algorithm that's
an approximation of [INDISTINCT]. And so we can apply it to something. So it's a nice
speech recognition task that's been well organized by the speech people where there's an old
database called TIMIT, it's got a very well-defined task for phone recognition where what you
have to do is you're given the short window speech, you have to predict the distribution,
the probability for the central frame of the various different phones. Actually, each phone
is modeled by 3-state HMMs, sort of beginning middle and end, so you have to predict for
each frame is it the beginning middle or end of each with the possible phones, there's
a 183 of those things. If you give it a good distribution there to sort of focus on the
right thing then all the post-processing will give you back where the phoning bandwidth
should be and what your phone arrow radius, and that's all very standard. Some people
use tri-phone models. We're using bi-phone models which aren't quite as powerful. So
now we can test high goodwill by taking 11 frame of speech. It's 10 milliseconds per
frame but each frame is looking at like 25 milliseconds of speech and predicting the
phone at the middle frame. We use the standard speech representation which is mel-cepstral
coefficients. There's 30 of those, and there are differences and differences and difference,
differences; and we feed them in to one of these deep nets so. So here's your input,
11 frames and 39 coefficients. And then--I was away when the student did this and he
actually believed what I said. So he thought adding lots and lots of hidden units was a
good idea. I've started it too. But he added lots of hidden units all unsupervised, so
all this green connections are learned without any use of the labels. He used to bottleneck
there, so the number of Reg connections will be relatively small. These are not--these
have to be learned using discriminative information. And now you're back propagating the correct
answers through this whole net for about a day on a GPU board or a month on a core, and
it does very well. That is the best phone error rate we got was 23%. But the important
thing is whatever configuration you use, how many hidden layers as long as they are plenty
and whatever widths and whether you use this bottleneck or not, it gets between 23% and
24%. So it's very robust to the exact details of how many layers and how wide they are.
On the best previous result on TIMIT for things that didn't use speaker adaptation is 24.4%
and that was averaging together on lots of models, so this is good.
>> So each of these layers that's four million weights?
>> HINTON: Yup, four million weights. So we're only training one, two, three, one, two, three,
we're training, you know, about 20 million weights. Twenty million weights is about 2%
of a cubic millimeter of cortex. I think so, this is a tiny brain. The last, probably,
all you need for [INDISTINCT] recognition. >> Why did they start with the differences
and double differences of the MFCCs that you're going into a thing that could learn to do
that itself if they wanted to? >> HINTON: That's a very good question 'cause
you are sitting at the end. It's an extremely good question because the reason I put the
differences and double differences is so they can model a data with a diagonal co-variance
metric--diagonal co-variance model--and you can't model the fact that overtime two things
turn to be very much the same where modeling co-variance is, unless you actually put the
differences into the data and you model the differences directly. So it allows you to
use a model that conquered with co-variances. Later on we're going to show a model that
conquered with co-variances and then we are going to do what the client always said you
should do, which is throw away the mel-cepstral representation and use the better representation
in speech. >> I said that?
>> HINTON: Yes, you said that to me the last time [INDISTINCT].
>> Smart guy. >> HINTON: Okay, so the new idea is to use
a better kind of module. This module already works pretty well, right? You know, it does
well at forming recognition, it does well in all sort of things. It can't model multiplicative
interactions very well. It can model anything with enough training data, but its not happy
modeling multiplies. It multiplies all over the place. I'll show you a bunch of place
where you need multiplies. Here's the sort of main example of why you need multiplies.
Supposed I want to, from a high level of description of an object the name of the shape and it's
pose, size, position orientation. So, first of all, I want to generate the parts of an
object and I want them to be related correctly to each other. I could use very accurately
top down model that says none of this square and none pose gram, because I generate each
piece in exactly the right position; that would require high bandwidth. Or I could be
sloppy and I could say, "I need to generate this side, not sort of a representation and
distribution of where this side might be. And I'll generate corners and other sides
and they're all a bit sloppy. And if I picked one thing from each distribution, it would
make a nice square. But I could also top-down specify how these things should be pieced
together. In effect, I can specify a macro form of field; this is what goes with what?
And then I can clean this up knowing these distributions and pick a square like that.
Of course, I might sometimes pick a square that has slightly different orientation or
slightly different size, but it'll be a nice clean square because I know how they go together.
And so that's a much more powerful kind of generation model, and that's what we want
to learn to do, and so we are going to need hidden units up here to specify interactions
between visible units here, as opposed to just specifying input of visible units. There's
an analogy for this, which is, if I'm an officer and there's a bunch of soldiers and I want
them to stand in the square, I could get out my GPS and I can say, "Soldier number one,
stand at this GPS coordinates. And soldier number two, stand at these GPS coordinates."
Now, if I use enough digits, I'll get a nice neat rectangle, or I could say, "Soldier number
one, stand roughly around here. And then soldier number two, hold your arm out and stand this
distance from soldier number one." And that's a much better way to get a neat rectangle.
It will cause far less communication. So what you're doing is you're downloading roughly
where people should stand and then how they should relate to each other. We have to specify
the relations not just where they should be. And that's what we'd like in a powerful [INDISTINCT]
model. So, we're going to aim to get units in one layer to say how units in the layer
below should latterly interact when you generate it. It's going to turn out you don't need
to worry about these lateral interactions when you recognize it. When you generate,
you do. To do that, we're going to need things called third-order Boltzmann machines, which
has three lane tractions. So Terry Sejnowski, he pointed out a long time ago that we have
an energy function like this where this was V and this was H, but these are just binary
variables. And we could probably well write down an energy function like this with three
things in tract, then we have a three layer weight. And if you think about these three
things now, K, the state of K is acting like a switch. When K is on, you effectively have
this weight between I and J. When K is off, this weight disappears. And it happens every
which way because it's symmetric. So using an energy function like this, we can allow
one thing to specify how two other things should interact. So each hidden unit can specify
a whole mark of random field over the pixels if you want. But that sort of begins to make
you worry because a mark of random field has a lot of parameters in it. And if you start
counting in the sits here, if you have any of these and any of those and none of those,
you get enqueued to these parameters which is rather a lot. If you're willing to use
enqueued parameters, you can now make networks to look like this. Suppose I have two images
and I want to model how images transform over time, unless suppose I'm just moving random
dots around, have a pan of random dots and I translate it. Well, if I see that dot and
I see that dot, that's some evidence for a particular translation. And so if I put a
big positive weight there, this triangle is meant to interrupt that big three-way weight.
Then when this and this around, they'll say it's very good to have this guy up. It would
have been nice at low energy state. If I also see this pair of dots, I'll get more routes,
though, and this guy should be--and I will turn this guy on. If however this pixel went
to here, I'll go for this guy. And if this pixel also went to there but this guy--so
these guys are going to represent coherent translations of the image, and it's going
to be able to use these three-way weights to take two images and extract two units that
represent the coherent translation. It'll also be able to take the pre-image and the
translation, and compute with pixel should be on here. Now what we're going to do is
take that basic model and we're going to factorize it. We're going to say, "I've got these three-way
weights and I've got too many of them." So, I'm going to represent each three-way weight
as the product of three two-way things. I'm going to introduce these factors and each
factor is going to have these many parameters which is just pro-factor, is just a linear
number of parameters. If I have about N factors, I end up with only N square of these weights.
And if you think about how pixels transform in a new image, they don't do random permutations.
It's not that this pixel goes on that one, goes here. Pixels do sort of consistent things,
so I don't really need enqueued parameters because I'm just trying to model these fairly
consistent transformations, which is a limited number, and I should be able, too, in many
less parameters. And this is the way to do it. So, that's going to be our new energy
function given the bias terms. One way of thinking about how modeling a weight is I
want these tensor of three-way weights. If I take an IF product and two vectors like
this, I'll get a matrix that has rank one. So I get a three-way product. I'll get a tensor
that has rank one. And if I now add up a bunch of tensors like that so each factor now, each
F, specifies a rank one tensor, by adding up a bunch of them, I can model any tensor
I like if I use N square factors. If I use any N factors, I can model most regular tensors
but I can't model arbitrary permutations, and that's what we want. If you ask how does
inference works now, inference is still very simple in this model. So here's a factor.
Here's the weights connecting it to, say, the pre-image. Here's the weights connecting
it to the post-image. Here's the weights connecting it to the hidden units. And to do inference
what I do is this. Supposed I only have that one factor. I would multiply the pixels by
these weights; add all that up so I get a sum of this vertex. I do the same here; I
get a sum of this vertex. Then I multiply these two sums together to get a message going
in the center of the hidden units. And as that message goes to the hidden unit, I multiply
it by the weight on my connection. And so what the hidden unit will see is this weight,
turns the product of these two sums, and that is the derivative of the energy with respect
to the state of this hidden unit, which is what it needs to know to decide whether to
be on or off, it wants to go into what other state below the image. And all the hidden
units remain independent even though I've got these multipliers now. So this is much
better than putting in another stochastic binary unit here. If I put a stochastic binary
unit in here, the hidden units would cease to be independent and inference will get tough.
But this way, whether a deterministic factor that's taking a part of these two sums, inference
remains easy. The learning also remains easy. So this is the message that goes from factor
F to hidden unit H, and that message is the product that we got of these two lower vertices;
the product of the sums, the compute on the pre-image and the post-image. And the way
you learn the weight on the connection from factor F to hidden unit H is by changing the
weight so as to lower the energy when you're looking at data, and raise the energy when
you're constructing these from the model or just reconstructing things from the hidden
units you got from data. And those energy groups, they just look like this. They're
just the product of the state of the hidden unit and the message that goes to it when
you're looking at data and the state of the hidden unit and the message that goes through
it when you're looking at samples from the model or reconstructions. So it's still a
nice pair-wise learning rule. So everything is pair-wise still, so you might fit into
the brain. Now, if we look what one of these factors does when I show random dot patterns
to translate, then we can look at the weights connecting it to the pre-image, and that's
a pattern of weights where white is a big positive weight, black is a big negative weight
because that would have a learned a great in connecting it to the pre-image and this
will have learned a great in connecting it to the post-image. With a hundred factors,
I'll show you what Roland learned. So, those are the hundred the hundred factors connecting--these
are the receptive fields at the factors in the pre-image. And remember it's looking at
translating dots, and these are the factors in the post-image. And you see, it's basically
learned the freer basis and it's learned to translate things by about 90 degrees. And
that's a very good way of handling translation. Mathematicians say things like, "The freer
basis is a natural basis for modeling translation." I don't really know what that means, but just
learn the freer basis on that. And if you get rotations, it learned the different basis.
So this is the basis that learns for rotations. You see it learns about yin and yang here.
Oops [INDISTINCT]. Okay, that's the basis for rotations. One other thing you could do
is train it just on single dot pans and translating in a coherent way and then test it on two
overlaid dot patterns but they're translating different directions. It's never seen that
before. It's only been trained on coherent motion where we're going to test it on what's
called transparent motion. In order to see what it thinks, when we train the uncivilized,
there's no labels anywhere, we never tell it what the notions are, we need some way
to seeing what it's thinking, so we add a second hidden layer that looks at the hidden
units representing transformations and it's fairly sparse. So the units on that second
hidden layer will be tuned to particular directions of motion. And then to see what it's thinking,
we take the directions those units like weighted by how active those units are and I will tell
you what directions it thinks it's seeing. Now when you show it transparent motion and
you look at those units in the second hidden layer, if the two motions are within about
30 degrees, it sees a single motion of the average direction. If they're beyond about
30 degrees, it sees two different motions and once more they're repelled from each other.
That's exactly when I was with people, and so this is exactly how the brain works. Okay.
There's going to be a lot of that kind of reasoning in this talk. I'm going to on to
time series models now. So, we'd like to model not just static images, for example, we like
to model video. To be [INDISTINCT] we're going to try something a bit simpler. When people
do time series models, you would nearly always like to have a distributed non-linear representation,
but that's hard to learn. So people tend to do dumb things like Hidden Mark up Models
or Linear Dynamical Systems which either give up on the distributed or on the non-linear,
but are easy to doing inference. What we're going to come up with is something that has
the distributed and the non-linear and is easy to do inference, but the linear algorithm
isn't quite right but it's good enough. It's just an approximation to make some [INDISTINCT].
And the inference also is ignoring the future and just basing things on the past. So, here's
a basic module, and this is with just two-way interactions. This is the Restricted Boltzmann
Machine with visible units and hidden units. Here are the previous visible frames. These
are all going to be linear units. And so, these blue connections are conditioning the
current visible values on previous observed values in a linear way. So, it's called an
autoregressive model. The hidden units here are going to be binary hidden units; they're
also conditioned on previous visible frames, and learning is easy in this model. What you
do is you take your observed data, and then given the current visible frame and given
the previous visible frames, you got import to the hidden units, they're all independent
given the data, so you can separately decide what states they should be in. Once you fixed
states for them, you now reconstruct the current frame using the input you're getting from
previous frames and using the top you got in from the hidden units. After we construct,
you then activate the hidden units again. When you say the difference in the power statistics
with data here and the reconstructions here to learn these weights and you take the difference
on activities of these guys with data with reconstructions to get signal that you can
used to learn these weights or these weights. So learning is straightforward and it just
depends on differences, and you can learn a model like this. After you've learned it,
you can generate from the model by taking some previous frames. These inputs, the conditioning
inputs, in effect, fixed the biases of these to depend on the previous frames. So, these
are the dynamic biases, and with these biases fixed, you just get backwards and forwards
for awhile and then pick a frame there, and that's your next frame you regenerated, then
you keep going. So, we can generate from the model once it learns so we can see what it
believes. >> You always go back two steps in time or
is that just an example? >> HINTON: Sorry.
>> Oh, you were just going back only two steps in time?
>> HINTON: No, we're going to get back more sets in time.
>> Okay, and you let... >> HINTON: I just got lazy with the PowerPoint.
Now, one direction we could go from here is to higher level models. That is, having learned
this model where these hidden units are all independent given the data, we could say--well,
what I've done is I've turned the visible frames into the hidden frames map. And it
turns out you can get a better model if you take these hidden frames, a model what's going
here, and now you put in conditioning connections between the hidden frames and more hidden
units that don't have conditioning here that don't interact with other hidden unit. [INDISTINCT]
in this model. Then you can prove that if you do this right, then you'll get a better
model of the original sequences or your improver band on the model of the original sequences.
So you can [INDISTINCT] lots of layers like that. And when you have more layers, it generates
better. But I'm going to go in a different direction. I'm going to show you how to do
it with three-way connections. And we're going to apply it to motion-capture data, so you
put reflective markers on the joints, you have lots of infrared cameras, you figure
out where the joints are in space. You know the shape of the body so you go backwards
through that to figure out the joint angles and then the frame of data is going to consist
of 50 numbers, about 50 numbers which are joint angles and the translations and rotations
of the base of the spine. Okay. So, imagine we got--one of these mannequins you see in
art shop windows, we got a pins stuck in the base of his spine and we can move him around
and rotate him using this pin and we can also wiggle his legs and arms. Okay. And what we
want him to do is as we move him around, we want him to wiggle his legs and arms so his
foot appears to be stationary on the ground and he appears to be walking. And he'd better
wiggle his leg just right as we translate his pelvis, otherwise his foot will appear
to skid on the ground. And we're going to model him, we can do hierarchal model like
I just showed you or we can a three-way model like this where we condition six earlier of
frames, this is a current visible frame, here's basic bolts in the machine accept that it's
neither one of these 3-way things where these are factors. And we have a 1-of-N style variable.
So, we have data and we tell it the style when we're training it, so that's sort of
semi-supervise. It learns to convert that 1-of-N representation in to a bunch of real
value features and then it uses this real value features as one of the inputs to a factor.
And what the factor are really doing is saying, these real value features are modulating the
weight matrixes that use for conditioning and also this weight matrix that use in your
parallel linear model. So, these are modulating an auto aggressive model. That's very different
from switching between auto aggressive model it's much more powerful. Yeah?
>> I missed with what this one event is...? >> HINTON: So, we're going to have data of
someone walking in various different styles. >> Styles of walking.
>> HINTON: The style of walking. Yeah. >> So you mean your earlier diagram when you
can't beat history, it looked like there was nothing to keep track in the relative order
of the earlier direct delta direct link because... >> HINTON: Yes.
>> ... is there anything in the model that cares about that relative...?
>> HINTON: Yeah. Yeah. The weights on the connections will tell you which frame it's
coming from. Right. In the earlier model, there were two blue lines, they're different
matrixes and they have different weights on. >> There's nothing from two steps to previews
to ones step previews, right, it just skip all the way?
>> HINTON: It just skipped all the way, right. You just...
>> Will that continue to happen? >> HINTON: Yes. In other words there's direct
connections from all six previous frames to the current frame for determining the current
frame. >> Right. And then what links from the six
frames of the fifth earliest report? >> HINTON: Well, there where when you were
computing what the fifth frame was doing, right?
>> Okay. >> HINTON: But when we're computing this frame
we have direct connections from it. Okay. So, we're now going to train this model, it's
relatively easy to train especially on the GPU board, and then we're going to generate
from it, so we can see sort of what it learned and we can judge if it's doing well by whether
the feet slip on the ground. All right. >> [INDISTINCT]
>> HINTON: We'll get there. >> Sorry.
>> HINTON: Here's a normal walk. Maybe, at least they're willing. Okay. So I was generating
from the model--he's deciding which direction to turn in, and he's deciding, you know, he
needs to make the outside leg go farther than the inside leg and so on. If we--we have one
model but if we flip the style label to say, gangly teenager, he definitely looks awkward.
Right. We've all been there. I think this is a computer science student. My main reason
for thinking that is if you asked him do a graceful walk, it looks like this. And that's
definitely C3PO. >> [INDISTINCT].
>> HINTON: Now, I think this was a student [INDISTINCT]--but he's very good. You can
ask him to walk softly like a cat. We're asking to model at present, right? The model looks
pretty much like the real data the real data obviously the feet are planted better but
notice, he can slow down then speed up again. Auto aggressive models can't do things like
that, Auto aggressive models have a biggest size in value, the size is bigger than one
in which case they explode or a smaller one in which case they die and the way to keep
them alive is by keep--you keep injecting random noise so that they stay alive and that's
like making a horse walk by taking a dead horse and jiggling it, it's kind of--it's
not good. Now, he doesn't have any model of the physics so, in order to do this kinds
of stumbles, there had to be stumble similar to that in the data but when he stop in which
he stumble he did when, he's entirely determining. We could make him do a sexy walk but you're
probably not interested in that. >> I just order a chicken.
>> HINTON: You want dinosaur the chicken? Where's dinosaur the chicken?
>> And chicken, number five. >> At number five.
>> HINTON: Oh, no, that's dinosaur and chicken. That's a blend. Maybe a switch. He's got quite a lot of foot
[INDISTINCT] that's probably a blend. This is doing a sexy walk and then you flip the
label to normal and then you flip it back to sexy. It's never seen any transitions but
because all one model, it can do reasonable for transitions.
>> So you have these hundred style variables, can you de-couple those form the one event
style and just make up new styles by playing with those...
>> HINTON: Yup. Yup. Now, you can also give it many more labels when you train, you can
give it speed, stride length all sorts of things then you can control it very well,
yeah. Okay. So, you can learn time series at least for 50 dimensional data and obviously
what we all want to do is apply that to video but we haven't done that yet. Except for some
very simple cases. The last thing I'm going to show is the most complicated use of these
3-way models. One way of thinking of it, so that it's similar to the previous uses, is
that we take an image and we make two copies of it but they have to be same. And then we
insist the weights that go from a factor of this copy are the same as the weights that
go from the factor of this copy. So if I=J, WI=WJF. Inference is still easy in fact inference
here will consist of--you take these pixels times these weights to get a weighted sum,
and then you square it because this is going to be the same weighted sum. So, inference
consist--take linear filter, square its output, and send it by these weights to the hidden
units. That's exactly the model called the [INDISTINCT] energy model, is right, kind
of linear filter. This is being proposed both by Vision people by Adelson and Bergen, a
long time ago in the '80s, and by neuroscientists. So, neuroscientists had tried to take simple
cells, my point vaguely about, and look at what polynomial they're output is of their
input, and Yang Dan at Berkeley says it's between 1.7 and 2.3, and that's means two.
So, this looks quite like models that were proposed for quite different reasons and it
just drops out of taking a 3-way imaging model and factorizing it. The advantage we have
is that we have a learning algorithm for all these weighs now, when we generative model.
So now we can model covariances between pixels, and the reason that's good is–-well, here's
one reason why it's good. Suppose I asked you to define a vertical edge. Most people
will say, "Well, vertical edge is something that light on the side and dark on that side.
Well no, maybe it's light on this side and dark on that side, but you know. Well, it
could be light up here and dark down there, and dark up here and light down there." Okay.
Or it could be texture edge. It's getting--oh, it might actually be a disparity edge. Well,
the manner should be motion this side and no motion that side. That's a vertical edge
too. So, a vertical edge is a big assortment of things, and what all those things have
in common is vertical edge is something where you shouldn't do horizontal interpolation.
Generally new image, horizontal interpolation works really well. A pixel is the average
of its right and left neighbors, pretty accurately almost all the time. Occasionally it breakdowns,
and the place it breaks down is where there is a vertical edge. So, a real abstract definition
of vertical edge is breakdown of horizontal interpolation. And that's what our models
are going to do. A hidden unit is going to be putting in interpolation, and it's actually
going to turn-off sort of reverse logic, when that breaks down its going to turn-off, so
one way of seeing it is this. If this hidden unit here is on, it puts in a weigh between
pixel I and pixel J that's equal to this weight times this weight, times this weight. Okay.
Since these–okay, that's good enough. So, these are controlling affectively the mark
of random field between the pixels, so we can model covariances nicely. Because the
hidden units are creating correlations between the visible units reconstruction is now more
difficult. We could reconstruct one image given the other image, like we did with motion,
but if you want to reconstruct them both and make them identical it gets to be harder.
So, we have to use a different mathical Hybrid Monte Carlo. Essentially you start where the
data was and let it wonder away from where it was but keeping both images the same. And
I'm not going to go to Hybrid Monte Carlo, but it works just fine for doing the learning.
And the Hybrid Monte Carlo is used just to get the reconstructions and the learning algorithm
just the same as before. And what we're going to do is we're going to have some hidden units
that are using these 3-way interactions to model covariances between pixels, and other
hidden units are just modeling the means. And so we call-–for meaning covariance,
we call this mcRBM. Here's an example of what happens after it's learned on black and white
images. Here's an image patch. Here's its reconstruction of the image patch, if you
don't have noise, which is very good, from the mean and covariance units. Here's the
stochastic reconstruction which is also pretty good. But now we're going to do something
funny, we're going to take the activations of the covariance units. The things that are
modeling which pixels are the same with which other pixels and we're going to keep those.
But we are going to take the activations of the mean units, so we're going to throw those
away, and pretend that the means from the pixels look like this. Well, let's take this
one first. We tell all the pixels have the same value, except these are which are much
darker and it now tries to make that information about means fit in with these information
about covariances which is of these guys should be the same but very different from these
guys. And so, it comes up with the reconstruction that it looks like that. Where you see it's
taken this dark stuff and blurred across this region here. If we just give it four dots
like that, and the covariance matrix you've got from there, it'll blur those dots out
to make an image that looks quite like that one. So this is very like what's called the,
kind of watercolor model of images, where you know about where the boundaries are and
you just, sort of, roughly sketching the colors of the regions and it all looks fine to us,
because we sort of slaved the color boundaries to the actual--where the edges are. If you
reversed the colors of these it produce the reversed image because the covariance doesn't
care at all about the signs of things. If you look at the filters at your lens, the
mean units which are for sort of coloring in regions, learn these blurry filters--and
by taking some combination of a few dozen of those you can make more or less of what
other colors you like anywhere. So, very blur there-–smooth, blurry, and multicolored
and you can make roughly the right colors. The covariance units learn something completely
different. So, these are what the filters learned and you'll see that, those factors,
they learn high frequency black and white edges. And then a small number of them, turning
to low frequency color edges that are either red, green, or yellow blue and what's more
when you make it from a topographic map using a technique I'll describe on the next slide.
You get this color blob, this low frequency color blob in with the low frequency black
and white filters. And that's just what you see in a monkey's brain, pretty much. If you
go into a monkey's brain you'll see these high frequency filters whose orientation changes
smoothly as you go through the cortex tangentially, and you'll see these low frequency color blobs.
Most neuroscientists thought that at least must be innate. What this is saying is, "Nope.
Just the structure of images is, and the idea of forming a topographic map, is enough to
get this." That doesn't mean it's not innate, it just means it doesn't need to be. So the
way we get the topographic map is by this global connectivity from the pixels to the
factors. So the factors really are learning local filters. And the local filters start
off colored and gradually learn to be exactly black and white. Then there's local connectivity
between the factors in the hidden units. So one of these hidden units will connect to
a little square of factors and that induces a topography here and the energy function
is such that when you turn off one of these hidden units to say smoothness no longer applies,
you pay a penalty. And you derive to just pay the penalty once. And so two factors is
going to come on at the same time, it's best to connect them to the same hidden unit so
you only pay the penalty once. And so that will cause similar factors to go to similar
places in here when we get a topographic map. For people who know about modeling images,
as far as I know, nobody has yet produced a good model of patches of color images. That
is the genres of model that generates stuff that looks like the real data. So, here's
a model that was learned on 16x16 color images from the Berkeley database and here's these
generated from the model. And they look pretty similar. Now, it's a partly a trick, the color
balance here is like the color balance and it makes you think they are similar. But,
it's partly real. I mean, most of these are smooth patches of roughly uniform color as
are most of these. These are few more of these as smooth than those. But you also get these
things where you get fairly sharp edges, so you get smoothness than a sharp edge than
more smoothness, like you do in the real data. You even get things like corners here. We're
not quite there yet but this is the best model there is in patches of color images. And it's
because it's modeling both the covariance and the means, so it's capable of saying,
"What's the same as what?" As well as, "What the intensities are?" You can apply it for
doing recognition. So this is a difficult object recognition to ask where this 80 million
unlabelled training images; not only of these classes but of thousands and thousand of classes.
They were collected by people at MIT. It's called the Tiny Images database. They're 32x32
color images. But it's surprising what you can see in a 32x32 color image. And since
the biggest model we're going to use has about a hundred million connections, that's about
0.1 of a cubic in ratio of cortex in terms of the number of parameters, and so we have
to somehow give our computer model some way of keeping up with the brain which has a lot
more hub, right? And so we do it by giving it a very small retina. We say, "Suppose the
input was only 32x32, maybe we can actually do something reasonable there." So as you'll
see there are a lot of variations. If you look birds, that's a close up of an ostrich,
this is a much more typical picture of a bird. And it's hard to tell the difference within
these tiny categories. Particularly things like deer and horse. We deliberately chose
some very similar categories like truck and car, deer and horse. People are pretty good
at these. People won't make very many errors. That's partly because these were hand-labeled
by people, so. But even people make some of errors. We only have 50,000 training examples.
Five thousand of each class and ten thousand test examples, because we have to hand-label
them, but we have a lot of untrained--unlabelled data. So we can do all these pre-training
on lots of unlabelled data and then take out covariance units on our mean units and just
try doing multi-[INDISTINCT] on top of those, or maybe add another hidden layer and do it
on top of that. So, what Marc'Aurelio Ranzato actually did since he worked in Yann LeCun's
lab, he actually took smaller patches learned the model and then strode them across the
image and replicated them. So it's a sort of a semi-convolutional. And then took the
hidden units of all of these little patches and just concatenated them to make a great
big vector of 11,000 hidden units which are both the means on the covariances. And then
we're going to use that as our features and see how well we can do. And we're going to
compare it with various other methods. So the sort of first comparison, you just take
the pixels and do logistic ration on the pixels to slide on the tiny glasses. You get 36%
right. If you take GIST features which has developed by Torralba and the people at MIT,
which were meant to capture what's going on under the image quite well, but they're fairly
low dimensional, you get 54%. So they're much better than pixels. If you take a normal RBM
which has linear units with glass and noises input variables and then binary hidden units,
and then you use those binary hidden units to do castigation, you get 60%. If you use
one of these RBMs with both the units like these once for doing the means, and then these
units with the three range interaction for modeling covariances, you got 69%; as long
as you use a lot of these factors. And if you then learn an extra hidden layer of 8,000
units--so now it's just that times that is a hundred million, so there's an extra hundred
million connections you learn there. But that's fine because it's unsupervised then you just
learn it on lots of data. You get up to 72%. And that's the best result so far on this
database. One final thing, you can take this model that was develop for image patches and
the student that'll be doing framing recognition just took that code and applied it to log
spectrograms, which is sort of more close to what they would like to see, you're not
using all these mark up fool stuff, which is designed to throw away stuff you think
you don't need and get rid of lots of correlations. Instead you're going to take data that has
lots of correlations in but we got a model that can deal with that stuff now. And the
first thing George tried on February the 20th, which was four layers of a thousand hidden
units on top of this, he got 22.7 correct--percent correct; which was the record for phoneme
recognition on the TIMIT database where your not trying to do a model adapted to each speaker.
And then a week later when he did that to TIMIT and use more frames, he was down to
21.6%. So this--all this stuff was designed to do vision. It wasn't designed to do phonemes.
And if we treat phoneme recognition, it's just a vision problem on a lot of spectral
ground. We can wipe out the speech class, at least on small vocabulary. Another student
is now, at Microsoft, is seeing if this will work on big recovery as well.
>> [INDISTINCT] >> HINTONS: Yes. Yes, right.
>> We can give them new better tools. >> HINTONS: We can give them new and better
tools. So here's phoneme recognition over the years. Backprop from the 80's got 26.1
percent correct. Over the next 20 years or so, they got that down to 24.4 percent, using
methods that weren't learning-inspired so we'll call them artificial. We then got down
to 21.6 percent; an estimate of human performance is about 15 percent. I don't know much about
how they did this estimate, I'm afraid. But we're about--we're nearly a third of the way
from artificial TIMIT. And so we need two more ideas and we're there. Okay, I'm done.
I'm finished. >> Questions?
>> HINTONS: Yes? >> You mentioned YouTube recently announced
that the [INDISTINCT] have broken the world record on the end list of data sets of phoneme
recognition by simply using a seven layered feed forward network trained with backprop,
but doing it on a GPU with lots and lots of cycles.
>> HINTONS: Yes, he did indeed announce that. What he didn't announce was--he's got a spectacular
result. He gets timed to 35 errors. What he didn't announce was there's two tricks involved.
One trick is to use a big net with lot of layers in a GPU board. That trick by itself
wouldn't give you 35 errors. There's a second trick which was sort of pioneered by people
at Microsoft in fact, which is to put a lots of work into producing distortions of the
data so you have lots and lots of labeled data. So you take a labeled image of a two
and you distort it in clever ways and make it still look like a two but be translated
so people can then get down to about 40 errors. >> I think they patented that already.
>> HINTONS: Good. So Dick's already patented that. So you get down to--you can get down
to by 40 errors by doing these distortions. What he did was even better distortions, or
more of them, and a much bigger net on a GPU and he got from a 40 to 35, which is impressive
because it is hard to make any progress there. But it won't work unless you have a lot of
labeled data. And what's--the disguised thing is the work went into--if you look in the
paper, it's always straightforward, its just backprop, except when you get to the section
of how they generated all those sector labeled data where there's very careful things, like
if it's a one or a seven they'd only rotate it a certain number of degrees but if it's
something else they rotate it in more degrees. I'm actually the referee for this paper but
I don't mind him knowing. I think it's a very important work. But he should emphasize that
they have to have labeled data to do that, and they have to put work into distortions.
So for me the lesson of that paper is when we small computers, you should put your effort
into things like weight constraints so you don't have too many parameters because you
only got a small computer. As computer gets bigger and faster, you can transfer your effort
from, instead of tying the weights together, like Yann was doing in the early days, put
your effort into generating more distortions so you can inject your prior knowledge in
the form of distortions and that's much less complication-efficient over the big computers,
it's fine and it's more flexible. So I think that's the lesson of that paper.
>> I shouldn't even need to ask you a question, you answered it. Thank you.
>> HINTON: Any other long question? >> It seems like you've invented some kind
of a cortex here that house you expect the property that if it does vision it'll do sound.
>> HINTONS: Yes. >> What other problems you going to apply
it to? >> HINTONS: Maybe it'd be quicker to say the
problems we're not going apply. >> Okay.
>> HINTONS: I can't think of any. I mean--okay, let me say what the main limitation of this
is for vision. We got at least 10 billion neurons for doing visual things; or at least
a billion anyway, probably, 10 billion. And even if we got that many neurons and about
10 to the 13 connections for doing vision, we still have a retina that's got a very small
phoneme the size of my thumb there at arms length. And so we still take almost everything
and don't look at it. I mean, the essence of vision is not to look at almost everything
intelligently; and that's why you got all this funny illusions where you don't see things.
We have to do that in these models. These models are completely crazy. And all of computer
visions are completely crazy, almost all of it. Because they take a uniform resolution
image, and quite a big one like a thousand by thousand, and they try and deal with it
all at once with filters all over the image. And if they going to do a selection, they
either do it by running off their face to get to everywhere, with no intelligence, or
they do sort of interest point detection at a very low level to decide what to attend
to. What we do is we fixate somewhere. Then on the basis of what our retina gives us,
with these big pixels around the edges and small pixels in the middle, we sort of decide
what we seeing and where to look next and by the second or third fixation we've fixating
very intelligently and the essence of it is that vision is sampling, it's not processing
everything; and that's completely missing from what I said. Now in order to do that,
you have to be able to do take what you saw and where you saw it and combined them and
that will multiply. So this module, it can do multiply. It's very good in combining what's
and where's, to integrate information at a time. And that's one of the things, we're
working on that. But that's probably the biggest thing missing. But that is an example of having
a module is quite good but now it's never good enough, so you have to put it together
over time and use it many times. And that's what sequential reasoning in all this stuff
are. So basically, as soon people become sequential we're not modeling that at all. We're modeling
what you can do in hundred milliseconds. And so that's what's missing. But I believe that
to model that sequential stuff we need to understand what is the sequence of, is the
sequence of these very powerful operations. And we're in a better shape now to try and
model sequential AI, than we were if we didn't know what a primitive operation is. So this
sort of primitive operation was just deciding whether two symbols are the same. We're going
to be out of luck for understanding how people do sequential stuff. Yeah.
>> This is a [INDISTINCT] question as he said he wanted to do everything if it connects.
Are you going to do [INDISTINCT] logic like there exists a God and every girl has a boy
she loves? >> HINTON: Hang on, I'm still processing that.
Right. Right, I'm making the point that people find "quantifies" quite difficult.
>> Oh, yeah. If you [INDISTINCT] quantifiers... >> HINTON: I would love to do that. I have
not got a clue how to do it. And you will notice that in old-fashioned AI that you used
to point out to [INDISTINCT] people, then you can't do quantifiers, so forget it. Nowadays,
when they all do graphical models, they didn't mention that anymore because the graphical
models have difficulty of it too. Some people has got [INDISTINCT] some people do. Right.
Yeah, some people do. But most of the graphical models of, like, five years ago, they do quantifiers
either. And so, a pretty good division line would be what you can do without having to
deal with really sophisticated problems like that. I would love to know how we deal with
that, but I don't. >> Thank you.
>> HINTON: So, yeah, I'm going to give up on that right now.
>> So if you had 80 million labeled images and no extra unlabeled ones, would you do
your pre-training... >> HINTON: Yes. Yes
>> ...and then fine tuning to make us better? >> HINTON: In TIMIT, that's what we have.
In TIMIT, all the examples we have labels. It stirs a big wind to do the pre-training.
>> But you didn't sneak this result I'm just hearing about? It seems to suggest...
>> HINTON: Well, the audience switched it but I haven't tried with all these distortions
during pre-training. Now, I've assumed student called [INDISTINCT] who just produced a thesis.
Well, he tries things like that. He tries distortions in earnest and he uses special
distortions of his own. And the fact is distortions helped a lot. But if you do pre-training,
that helps some more too. And [INDISTINCT] results, yes, [INDISTINCT] results, suggest
that pre-training will get you through different part of the space even if you have all these
labeled data. So clearly, one thing that needs to be done is to try the pre-training and
combine with these labels. You don't have to have the pre-training, but I bet you, it
still helps. And I bet you, it's more efficient too. It's faster because the pre-training
is rather pretty fast, you always have to learn a very good model. You got lots of its
features. And starting from there, I think, you'll do better than he does just started
from random, and faster. That's just a prediction. You might even get done to 34 out of this.
The problem with [INDISTINCT] you can't get significance. TIMIT is really nice that way.
They designed it well, so you get higher rates. So you can see differences.
>> On the time series aspect, did you see anything that would you get inferences or
alterations that are beyond the size of the time window you're using?
>> HINTON: Sorry, I didn't understand the question. We have a limited time. We don't...
>> You have limited time, after training is there anything of a model that picks up...
>> HINTON: Nothing. >> Nothing.
>> HINTON: Nothing. It cannot deal with--it can't model host...
>> It has an internal state. It has an internal state.
>> HINTON: Right. But if sort of what happened 15 times steps ago really tells you what should
happen now, and it only tells you what you should happen now. It doesn't tell you what
should happen in TIMIT 14 times steps. It just contains information across 15 times
steps without having a signature of smaller time scales. You can't pick up on that.
>> Okay. >> HINTON: Because it's not got a hidden forward-backward
algorithm. A forward-backward algorithm potentially could pick up a lot of load, actually can't.
>> So this one wouldn't pick up on things like object permanence or all rules behind
the box and comes out of the other side and they're not going to be able to...
>> HINTON: Not over a long time scale, no, no. Unless you say that there's a memory involved
when you go back to a previous--it gets more complicated, right? Now, it is true that when
you build the multilevel one, which you can do with the three interconnections as well
as with the three-way connections, at every level you're getting a bigger time span because
your time window, it's going further back into the past with each level. So you get
a bit high, but that's just sort of linear. >> Can you say--do you have any rules of thumb
of how much unlabeled data you need to train each of the different levels and how it would
change, like, is it just linear with the number of rates or as you go up levels the things
changed? >> HINTON: I have one sort of important thing
to say about that, which is that if you're modeling high-dimensional data and you're
trying to build an unsupervised model of the data, you need many less trainings on [INDISTINCT]
than you would have thought if you use the discriminative learning. When you're doing
discriminative learning, there's typically a very few bits per training case to constrain
the parameters. You're going to constrain--you got many new parameters for a training case
is the number of bits it takes to specify the answer, not the number it takes to specify
the input. So within this, you get 3.3 bits per case. If you're modeling the image, the
number of bits per case is the number of bits it takes to specify to image which is about
a hundred bits. So you need far fewer cases per parameter. In other words what I'm saying
is you're modeling much which are things, and so each case is giving you much more information.
So actually, we can typically model many more parameters than we have training cases. And
discriminative people aren't used to that. Many less parameters than we have pixels and
many more than training cases. And in fact, he used about two million cases for doing
the image stuff, and it wasn't enough, it was over fitting. He should have used more.
But he was fitting 100 million parameters. But the--basically, the only rule of thumb
is many less parameters and the number of total number of pixels in your training data,
but you can typically use many more parameters in the number of training cases. And you can't
do that with normal discriminative learning. Now, if you do do that, when you start discriminative
training, it quickly improves things and then very quickly over fits. So you have to stop
it early. Okay. >> Okay?
>> HINTON: Thanks. >> Let's thank the speaker again.
>> Thank you.