>>

So it's my pleasure to introduce to you Geoff Hinton, who is a pioneer in machine learning

and neural nets, and more recently [INDISTINCT] architectures. And then, I think that's going

to be topic of today. So take it over. >> HINTON: Okay. So I gave it to Okeer a couple

of years ago. And the first 10 minutes or so will be an overview of what I said there,

and then I'll talk about the new stuff. The new stuff consists of a better learning module.

It allows you to learn better in all sorts of different things, like, learning how images

transform, learning how people walk, and learning object recognition. So the basic learning

module consists of some variables that represent things like pixels, and these will be binary

variables for that. Some variables that represent--these are latent variables, they're also going to

be binary. And there's a bipartite connectivity, so these guys are connected to each other.

And that makes it very easy if I give you the states of the visible variables to infer

the states of the hidden variables. They're all independent given the visible variables

because it's a non-directed graph. And the input procedure just says, the probability

of turning on hidden unit "hj" given this visible vector "v" is the logistic function

of the total what he gets from his audience, so very simple for the hidden variables. Given

the hidden variables, we can also infer the visible variables very simply. And if we want

some--if we put some weights on the connections and we want to know what this model believes,

we can just go back and then forward inferring all the hidden variables in parallel than

all the visible ones. Do that for a long a time, and then you'll see examples of the

kinds of things it likes to believe. And the end of learning is going to be to get it to

like to believe the kinds of things that actually happen. So this thing is governed by an energy

function that has given the weights on the connections. The energy of a visible plus

a hidden vector is the sum overall connections of the weight if both the visible and hidden

units are active. So I'm going to pick some of the features that are active, you adding

the weight, and if it's a big positive weight, that's low energy, which is good. So, it's

a happy network. This has nice derivatives. If you differentiate it with respect to the

weights, you get this product of the visible and hidden activity. And so, that the derivative

is going to show up a lot in the learning because that derivative is how you change

the energy of a combined configuration of visible and hidden units. The probability

of a combined configuration, given the energy function, is E to the minus the energy of

that combined configuration normalized by the partition function. And if you want to

know the probability of a particular visible vector, you have to sum all the hidden vectors

that might go with it and that's the probability of visible vector. If you want to change the

weights to make this probability higher, you always need to lower the energies of combinations

of visible vector on the hidden vector that would like to go with it and raise the energies

of all other combinations, so you decrease the computation. The correct maximum likelihood

learning rule that is if I want to change the weights so as to increase the log probability,

that this network would generate the vector "v" when I let it just sort of fantasize the

things it like to believe in is a nice simple form. It's just the difference of two correlations.

So even though it depends on all the other weights, it shows up this is difference of

correlations. And what you do is you take your data, you activate the hidden units,

that's to classify the units, and then we construct, activate, we construct activate.

So this is a mark of chain. You run it for a long time, so you forgot where you started.

And then you measure the correlation there, start with the correlation here. And what

you're really doing is saying, "By changing the weights in proportion to that, I'm lowering

the energy of this visible vector with whatever hidden vector it chose. By doing the opposite

here, I'm raising the energy, the things I fantasize." And so, what I'm trying to do

is believe in the data and not believe in what the model believes in. Eventually, this

correlation will be the same as that one. In most case, nothing will happen because

it will believe in the data. In terms that you can get a much quicker learning algorithm

where you can just go on and [INDISTINCT] again, and you take this difference of correlations.

Justifying that is hard but the main justification is it works and it's quick. The reason this

module is interesting, the main reason it's interesting is you can stack them up. That

is for accompanied reason you're not going to go into it, it works very well to train

the module then take that activities of the feature detectors, treat them as so they were

data, and train another module on top of that. So the first module is trying to model what's

going on in the pixels by using these feature detectors. And the feature detectors would

tend to be highly correlated. The second model is trying to model a correlation among feature

detectors. And you can guarantee that if you do that right, every time you go up a level,

you get a better model of the data. Actually, you can guarantee that the first time you

go up a level. For further levels, all you can guarantee is that there's a bound on how

good your model of the data is. And every time we add another level, that bound improves

if we had it right. Having got this guarantee that something good is happening as we add

more levels, we then violate all the conditions of mathematics and just add more levels in

sort of [INDISTINCT] way because we know good things are going to happen and then we justify

by the fact that good things do happen. This allows us to learn many lesser feature detectors

entirely unsupervised just to model instruction of the data. Once we've done that, you can't

get that accepted in a machine learning conference because you have to do discrimination to be

accepted in a machine learning conference. So once you've done that, you add some decision

units to the top and you learn the connections discriminatively between the top-layer features

and the decision units, and then if you want you can go back and fine-tune all of the connections

using backpropagation. That overcomes the limit of backpropagation which is there's

not much information in the label and it can only learn on label data. These things can

learn on large amounts of unlabeled data. After they've learned, they you add these

units at the top and backpropagate from this small amount of label data, and that's not

designing the feature detectors anymore. As you probably know at Google, designing feature

detectors is the art of things and you'd like to design feature detectors based on what's

in the data, not based on having to produce labeled data. So the edge of backpropagation

was design your feature detectors so you're good at getting the right answer. The idea

here is design your feature detectors to be good at modeling whatever is going on in the

data. Once you've done that, just have a so slightly fine-tune and so you better get right

answer. But don't try and use the answer to design feature detectors. And Yoshua Bengio's

lab has done lots of work showing that this gives you better minima than just doing backpropagation.

And what's more minima in completing different part of the space? So just to summarize this

section, I think this is the most important slide in the talk because it says, "What's

wrong with million machine learning up to a few years ago." What people in machine learning

would try to do is learn the mapping from an image to a label. And now, it would be

a fine thing to do if you felt that images and labels are rows in the following way.

The stuff and it gives rise to images and then the images give rise to the labels. Given

the image, the labels don't depend on the stuff. But you don't really believe that.

You only believe that if a label is something like the parity of the pixels in the image.

What you really believe is the stuff that gives rise to images and then the labels that

goes with images because of the stuff not because of the image. So it's a cow in a field

and you say cow. Now, if I just say cow to you, you don't know whether the cow is brown

or black, or upright or dead, or far way. If I show an image of the cow, you know all

those things. So this is a very high bandwidth path, this is a very low bandwidth path. On

the right way to associate labels with images is to first learn to invert this high bandwidth

path. And we can currently do that because vision works basically. The first store you

look at then, you see things. And it's not like it might be a cow, it might be an elephant,

it might be electric theater. Basically, you get it right nearly all the time. And so we

can invert that pathway. Having learned to do that, we can then learn what things are

called. But you get the concept of a cow not from the name, but from seeing what's going

on in the world. And that's what we're doing and then later as I say from the label. Now,

I need to do one slight modification to the basic module which is I had binary units as

the observables. Now, we want to have linear units with Gaussian noise. So we just change

the energy function of it. And the energy now says, "I got a kind of parabolic containment

here." Each of these linear visible units has a bias which is like its mean. And it

would like to sit here and moving away from that [INDISTINCT] energy. The parabola is

the negative log of the Gaussian [INDISTINCT]. And then the input that comes from the hidden

units, this is just vi, hj, wij, but Vs have to be scaled by the standard deviation of

the Gaussian there. If I ask--if I differentiate that with respect to a visible activity, then

what I get is hj, wij divided by the sigma I. And that's like an energy gradient. And

what the visible unit does when you reconstruct is it tries to compromise between wanting

to sit around here and wanting to satisfy this energy gradient, so it goes to the place

where this two gradients [INDISTINCT] opposite and you have--that's the most likely value

and then you [INDISTINCT] there. So with that small modification we can now deal with real

value data with binary latent variables and we have an efficient learning algorithm that's

an approximation of [INDISTINCT]. And so we can apply it to something. So it's a nice

speech recognition task that's been well organized by the speech people where there's an old

database called TIMIT, it's got a very well-defined task for phone recognition where what you

have to do is you're given the short window speech, you have to predict the distribution,

the probability for the central frame of the various different phones. Actually, each phone

is modeled by 3-state HMMs, sort of beginning middle and end, so you have to predict for

each frame is it the beginning middle or end of each with the possible phones, there's

a 183 of those things. If you give it a good distribution there to sort of focus on the

right thing then all the post-processing will give you back where the phoning bandwidth

should be and what your phone arrow radius, and that's all very standard. Some people

use tri-phone models. We're using bi-phone models which aren't quite as powerful. So

now we can test high goodwill by taking 11 frame of speech. It's 10 milliseconds per

frame but each frame is looking at like 25 milliseconds of speech and predicting the

phone at the middle frame. We use the standard speech representation which is mel-cepstral

coefficients. There's 30 of those, and there are differences and differences and difference,

differences; and we feed them in to one of these deep nets so. So here's your input,

11 frames and 39 coefficients. And then--I was away when the student did this and he

actually believed what I said. So he thought adding lots and lots of hidden units was a

good idea. I've started it too. But he added lots of hidden units all unsupervised, so

all this green connections are learned without any use of the labels. He used to bottleneck

there, so the number of Reg connections will be relatively small. These are not--these

have to be learned using discriminative information. And now you're back propagating the correct

answers through this whole net for about a day on a GPU board or a month on a core, and

it does very well. That is the best phone error rate we got was 23%. But the important

thing is whatever configuration you use, how many hidden layers as long as they are plenty

and whatever widths and whether you use this bottleneck or not, it gets between 23% and

24%. So it's very robust to the exact details of how many layers and how wide they are.

On the best previous result on TIMIT for things that didn't use speaker adaptation is 24.4%

and that was averaging together on lots of models, so this is good.

>> So each of these layers that's four million weights?

>> HINTON: Yup, four million weights. So we're only training one, two, three, one, two, three,

we're training, you know, about 20 million weights. Twenty million weights is about 2%

of a cubic millimeter of cortex. I think so, this is a tiny brain. The last, probably,

all you need for [INDISTINCT] recognition. >> Why did they start with the differences

and double differences of the MFCCs that you're going into a thing that could learn to do

that itself if they wanted to? >> HINTON: That's a very good question 'cause

you are sitting at the end. It's an extremely good question because the reason I put the

differences and double differences is so they can model a data with a diagonal co-variance

metric--diagonal co-variance model--and you can't model the fact that overtime two things

turn to be very much the same where modeling co-variance is, unless you actually put the

differences into the data and you model the differences directly. So it allows you to

use a model that conquered with co-variances. Later on we're going to show a model that

conquered with co-variances and then we are going to do what the client always said you

should do, which is throw away the mel-cepstral representation and use the better representation

in speech. >> I said that?

>> HINTON: Yes, you said that to me the last time [INDISTINCT].

>> Smart guy. >> HINTON: Okay, so the new idea is to use

a better kind of module. This module already works pretty well, right? You know, it does

well at forming recognition, it does well in all sort of things. It can't model multiplicative

interactions very well. It can model anything with enough training data, but its not happy

modeling multiplies. It multiplies all over the place. I'll show you a bunch of place

where you need multiplies. Here's the sort of main example of why you need multiplies.

Supposed I want to, from a high level of description of an object the name of the shape and it's

pose, size, position orientation. So, first of all, I want to generate the parts of an

object and I want them to be related correctly to each other. I could use very accurately

top down model that says none of this square and none pose gram, because I generate each

piece in exactly the right position; that would require high bandwidth. Or I could be

sloppy and I could say, "I need to generate this side, not sort of a representation and

distribution of where this side might be. And I'll generate corners and other sides

and they're all a bit sloppy. And if I picked one thing from each distribution, it would

make a nice square. But I could also top-down specify how these things should be pieced

together. In effect, I can specify a macro form of field; this is what goes with what?

And then I can clean this up knowing these distributions and pick a square like that.

Of course, I might sometimes pick a square that has slightly different orientation or

slightly different size, but it'll be a nice clean square because I know how they go together.

And so that's a much more powerful kind of generation model, and that's what we want

to learn to do, and so we are going to need hidden units up here to specify interactions

between visible units here, as opposed to just specifying input of visible units. There's

an analogy for this, which is, if I'm an officer and there's a bunch of soldiers and I want

them to stand in the square, I could get out my GPS and I can say, "Soldier number one,

stand at this GPS coordinates. And soldier number two, stand at these GPS coordinates."

Now, if I use enough digits, I'll get a nice neat rectangle, or I could say, "Soldier number

one, stand roughly around here. And then soldier number two, hold your arm out and stand this

distance from soldier number one." And that's a much better way to get a neat rectangle.

It will cause far less communication. So what you're doing is you're downloading roughly

where people should stand and then how they should relate to each other. We have to specify

the relations not just where they should be. And that's what we'd like in a powerful [INDISTINCT]

model. So, we're going to aim to get units in one layer to say how units in the layer

below should latterly interact when you generate it. It's going to turn out you don't need

to worry about these lateral interactions when you recognize it. When you generate,

you do. To do that, we're going to need things called third-order Boltzmann machines, which

has three lane tractions. So Terry Sejnowski, he pointed out a long time ago that we have

an energy function like this where this was V and this was H, but these are just binary

variables. And we could probably well write down an energy function like this with three

things in tract, then we have a three layer weight. And if you think about these three

things now, K, the state of K is acting like a switch. When K is on, you effectively have

this weight between I and J. When K is off, this weight disappears. And it happens every

which way because it's symmetric. So using an energy function like this, we can allow

one thing to specify how two other things should interact. So each hidden unit can specify

a whole mark of random field over the pixels if you want. But that sort of begins to make

you worry because a mark of random field has a lot of parameters in it. And if you start

counting in the sits here, if you have any of these and any of those and none of those,

you get enqueued to these parameters which is rather a lot. If you're willing to use

enqueued parameters, you can now make networks to look like this. Suppose I have two images

and I want to model how images transform over time, unless suppose I'm just moving random

dots around, have a pan of random dots and I translate it. Well, if I see that dot and

I see that dot, that's some evidence for a particular translation. And so if I put a

big positive weight there, this triangle is meant to interrupt that big three-way weight.

Then when this and this around, they'll say it's very good to have this guy up. It would

have been nice at low energy state. If I also see this pair of dots, I'll get more routes,

though, and this guy should be--and I will turn this guy on. If however this pixel went

to here, I'll go for this guy. And if this pixel also went to there but this guy--so

these guys are going to represent coherent translations of the image, and it's going

to be able to use these three-way weights to take two images and extract two units that

represent the coherent translation. It'll also be able to take the pre-image and the

translation, and compute with pixel should be on here. Now what we're going to do is

take that basic model and we're going to factorize it. We're going to say, "I've got these three-way

weights and I've got too many of them." So, I'm going to represent each three-way weight

as the product of three two-way things. I'm going to introduce these factors and each

factor is going to have these many parameters which is just pro-factor, is just a linear

number of parameters. If I have about N factors, I end up with only N square of these weights.

And if you think about how pixels transform in a new image, they don't do random permutations.

It's not that this pixel goes on that one, goes here. Pixels do sort of consistent things,

so I don't really need enqueued parameters because I'm just trying to model these fairly

consistent transformations, which is a limited number, and I should be able, too, in many

less parameters. And this is the way to do it. So, that's going to be our new energy

function given the bias terms. One way of thinking about how modeling a weight is I

want these tensor of three-way weights. If I take an IF product and two vectors like

this, I'll get a matrix that has rank one. So I get a three-way product. I'll get a tensor

that has rank one. And if I now add up a bunch of tensors like that so each factor now, each

F, specifies a rank one tensor, by adding up a bunch of them, I can model any tensor

I like if I use N square factors. If I use any N factors, I can model most regular tensors

but I can't model arbitrary permutations, and that's what we want. If you ask how does

inference works now, inference is still very simple in this model. So here's a factor.

Here's the weights connecting it to, say, the pre-image. Here's the weights connecting

it to the post-image. Here's the weights connecting it to the hidden units. And to do inference

what I do is this. Supposed I only have that one factor. I would multiply the pixels by

these weights; add all that up so I get a sum of this vertex. I do the same here; I

get a sum of this vertex. Then I multiply these two sums together to get a message going

in the center of the hidden units. And as that message goes to the hidden unit, I multiply

it by the weight on my connection. And so what the hidden unit will see is this weight,

turns the product of these two sums, and that is the derivative of the energy with respect

to the state of this hidden unit, which is what it needs to know to decide whether to

be on or off, it wants to go into what other state below the image. And all the hidden

units remain independent even though I've got these multipliers now. So this is much

better than putting in another stochastic binary unit here. If I put a stochastic binary

unit in here, the hidden units would cease to be independent and inference will get tough.

But this way, whether a deterministic factor that's taking a part of these two sums, inference

remains easy. The learning also remains easy. So this is the message that goes from factor

F to hidden unit H, and that message is the product that we got of these two lower vertices;

the product of the sums, the compute on the pre-image and the post-image. And the way

you learn the weight on the connection from factor F to hidden unit H is by changing the

weight so as to lower the energy when you're looking at data, and raise the energy when

you're constructing these from the model or just reconstructing things from the hidden

units you got from data. And those energy groups, they just look like this. They're

just the product of the state of the hidden unit and the message that goes to it when

you're looking at data and the state of the hidden unit and the message that goes through

it when you're looking at samples from the model or reconstructions. So it's still a

nice pair-wise learning rule. So everything is pair-wise still, so you might fit into

the brain. Now, if we look what one of these factors does when I show random dot patterns

to translate, then we can look at the weights connecting it to the pre-image, and that's

a pattern of weights where white is a big positive weight, black is a big negative weight

because that would have a learned a great in connecting it to the pre-image and this

will have learned a great in connecting it to the post-image. With a hundred factors,

I'll show you what Roland learned. So, those are the hundred the hundred factors connecting--these

are the receptive fields at the factors in the pre-image. And remember it's looking at

translating dots, and these are the factors in the post-image. And you see, it's basically

learned the freer basis and it's learned to translate things by about 90 degrees. And

that's a very good way of handling translation. Mathematicians say things like, "The freer

basis is a natural basis for modeling translation." I don't really know what that means, but just

learn the freer basis on that. And if you get rotations, it learned the different basis.

So this is the basis that learns for rotations. You see it learns about yin and yang here.

Oops [INDISTINCT]. Okay, that's the basis for rotations. One other thing you could do

is train it just on single dot pans and translating in a coherent way and then test it on two

overlaid dot patterns but they're translating different directions. It's never seen that

before. It's only been trained on coherent motion where we're going to test it on what's

called transparent motion. In order to see what it thinks, when we train the uncivilized,

there's no labels anywhere, we never tell it what the notions are, we need some way

to seeing what it's thinking, so we add a second hidden layer that looks at the hidden

units representing transformations and it's fairly sparse. So the units on that second

hidden layer will be tuned to particular directions of motion. And then to see what it's thinking,

we take the directions those units like weighted by how active those units are and I will tell

you what directions it thinks it's seeing. Now when you show it transparent motion and

you look at those units in the second hidden layer, if the two motions are within about

30 degrees, it sees a single motion of the average direction. If they're beyond about

30 degrees, it sees two different motions and once more they're repelled from each other.

That's exactly when I was with people, and so this is exactly how the brain works. Okay.

There's going to be a lot of that kind of reasoning in this talk. I'm going to on to

time series models now. So, we'd like to model not just static images, for example, we like

to model video. To be [INDISTINCT] we're going to try something a bit simpler. When people

do time series models, you would nearly always like to have a distributed non-linear representation,

but that's hard to learn. So people tend to do dumb things like Hidden Mark up Models

or Linear Dynamical Systems which either give up on the distributed or on the non-linear,

but are easy to doing inference. What we're going to come up with is something that has

the distributed and the non-linear and is easy to do inference, but the linear algorithm

isn't quite right but it's good enough. It's just an approximation to make some [INDISTINCT].

And the inference also is ignoring the future and just basing things on the past. So, here's

a basic module, and this is with just two-way interactions. This is the Restricted Boltzmann

Machine with visible units and hidden units. Here are the previous visible frames. These

are all going to be linear units. And so, these blue connections are conditioning the

current visible values on previous observed values in a linear way. So, it's called an

autoregressive model. The hidden units here are going to be binary hidden units; they're

also conditioned on previous visible frames, and learning is easy in this model. What you

do is you take your observed data, and then given the current visible frame and given

the previous visible frames, you got import to the hidden units, they're all independent

given the data, so you can separately decide what states they should be in. Once you fixed

states for them, you now reconstruct the current frame using the input you're getting from

previous frames and using the top you got in from the hidden units. After we construct,

you then activate the hidden units again. When you say the difference in the power statistics

with data here and the reconstructions here to learn these weights and you take the difference

on activities of these guys with data with reconstructions to get signal that you can

used to learn these weights or these weights. So learning is straightforward and it just

depends on differences, and you can learn a model like this. After you've learned it,

you can generate from the model by taking some previous frames. These inputs, the conditioning

inputs, in effect, fixed the biases of these to depend on the previous frames. So, these

are the dynamic biases, and with these biases fixed, you just get backwards and forwards

for awhile and then pick a frame there, and that's your next frame you regenerated, then

you keep going. So, we can generate from the model once it learns so we can see what it

believes. >> You always go back two steps in time or

is that just an example? >> HINTON: Sorry.

>> Oh, you were just going back only two steps in time?

>> HINTON: No, we're going to get back more sets in time.

>> Okay, and you let... >> HINTON: I just got lazy with the PowerPoint.

Now, one direction we could go from here is to higher level models. That is, having learned

this model where these hidden units are all independent given the data, we could say--well,

what I've done is I've turned the visible frames into the hidden frames map. And it

turns out you can get a better model if you take these hidden frames, a model what's going

here, and now you put in conditioning connections between the hidden frames and more hidden

units that don't have conditioning here that don't interact with other hidden unit. [INDISTINCT]

in this model. Then you can prove that if you do this right, then you'll get a better

model of the original sequences or your improver band on the model of the original sequences.

So you can [INDISTINCT] lots of layers like that. And when you have more layers, it generates

better. But I'm going to go in a different direction. I'm going to show you how to do

it with three-way connections. And we're going to apply it to motion-capture data, so you

put reflective markers on the joints, you have lots of infrared cameras, you figure

out where the joints are in space. You know the shape of the body so you go backwards

through that to figure out the joint angles and then the frame of data is going to consist

of 50 numbers, about 50 numbers which are joint angles and the translations and rotations

of the base of the spine. Okay. So, imagine we got--one of these mannequins you see in

art shop windows, we got a pins stuck in the base of his spine and we can move him around

and rotate him using this pin and we can also wiggle his legs and arms. Okay. And what we

want him to do is as we move him around, we want him to wiggle his legs and arms so his

foot appears to be stationary on the ground and he appears to be walking. And he'd better

wiggle his leg just right as we translate his pelvis, otherwise his foot will appear

to skid on the ground. And we're going to model him, we can do hierarchal model like

I just showed you or we can a three-way model like this where we condition six earlier of

frames, this is a current visible frame, here's basic bolts in the machine accept that it's

neither one of these 3-way things where these are factors. And we have a 1-of-N style variable.

So, we have data and we tell it the style when we're training it, so that's sort of

semi-supervise. It learns to convert that 1-of-N representation in to a bunch of real

value features and then it uses this real value features as one of the inputs to a factor.

And what the factor are really doing is saying, these real value features are modulating the

weight matrixes that use for conditioning and also this weight matrix that use in your

parallel linear model. So, these are modulating an auto aggressive model. That's very different

from switching between auto aggressive model it's much more powerful. Yeah?

>> I missed with what this one event is...? >> HINTON: So, we're going to have data of

someone walking in various different styles. >> Styles of walking.

>> HINTON: The style of walking. Yeah. >> So you mean your earlier diagram when you

can't beat history, it looked like there was nothing to keep track in the relative order

of the earlier direct delta direct link because... >> HINTON: Yes.

>> ... is there anything in the model that cares about that relative...?

>> HINTON: Yeah. Yeah. The weights on the connections will tell you which frame it's

coming from. Right. In the earlier model, there were two blue lines, they're different

matrixes and they have different weights on. >> There's nothing from two steps to previews

to ones step previews, right, it just skip all the way?

>> HINTON: It just skipped all the way, right. You just...

>> Will that continue to happen? >> HINTON: Yes. In other words there's direct

connections from all six previous frames to the current frame for determining the current

frame. >> Right. And then what links from the six

frames of the fifth earliest report? >> HINTON: Well, there where when you were

computing what the fifth frame was doing, right?

>> Okay. >> HINTON: But when we're computing this frame

we have direct connections from it. Okay. So, we're now going to train this model, it's

relatively easy to train especially on the GPU board, and then we're going to generate

from it, so we can see sort of what it learned and we can judge if it's doing well by whether

the feet slip on the ground. All right. >> [INDISTINCT]

>> HINTON: We'll get there. >> Sorry.

>> HINTON: Here's a normal walk. Maybe, at least they're willing. Okay. So I was generating

from the model--he's deciding which direction to turn in, and he's deciding, you know, he

needs to make the outside leg go farther than the inside leg and so on. If we--we have one

model but if we flip the style label to say, gangly teenager, he definitely looks awkward.

Right. We've all been there. I think this is a computer science student. My main reason

for thinking that is if you asked him do a graceful walk, it looks like this. And that's

definitely C3PO. >> [INDISTINCT].

>> HINTON: Now, I think this was a student [INDISTINCT]--but he's very good. You can

ask him to walk softly like a cat. We're asking to model at present, right? The model looks

pretty much like the real data the real data obviously the feet are planted better but

notice, he can slow down then speed up again. Auto aggressive models can't do things like

that, Auto aggressive models have a biggest size in value, the size is bigger than one

in which case they explode or a smaller one in which case they die and the way to keep

them alive is by keep--you keep injecting random noise so that they stay alive and that's

like making a horse walk by taking a dead horse and jiggling it, it's kind of--it's

not good. Now, he doesn't have any model of the physics so, in order to do this kinds

of stumbles, there had to be stumble similar to that in the data but when he stop in which

he stumble he did when, he's entirely determining. We could make him do a sexy walk but you're

probably not interested in that. >> I just order a chicken.

>> HINTON: You want dinosaur the chicken? Where's dinosaur the chicken?

>> And chicken, number five. >> At number five.

>> HINTON: Oh, no, that's dinosaur and chicken. That's a blend. Maybe a switch. He's got quite a lot of foot

[INDISTINCT] that's probably a blend. This is doing a sexy walk and then you flip the

label to normal and then you flip it back to sexy. It's never seen any transitions but

because all one model, it can do reasonable for transitions.

>> So you have these hundred style variables, can you de-couple those form the one event

style and just make up new styles by playing with those...

>> HINTON: Yup. Yup. Now, you can also give it many more labels when you train, you can

give it speed, stride length all sorts of things then you can control it very well,

yeah. Okay. So, you can learn time series at least for 50 dimensional data and obviously

what we all want to do is apply that to video but we haven't done that yet. Except for some

very simple cases. The last thing I'm going to show is the most complicated use of these

3-way models. One way of thinking of it, so that it's similar to the previous uses, is

that we take an image and we make two copies of it but they have to be same. And then we

insist the weights that go from a factor of this copy are the same as the weights that

go from the factor of this copy. So if I=J, WI=WJF. Inference is still easy in fact inference

here will consist of--you take these pixels times these weights to get a weighted sum,

and then you square it because this is going to be the same weighted sum. So, inference

consist--take linear filter, square its output, and send it by these weights to the hidden

units. That's exactly the model called the [INDISTINCT] energy model, is right, kind

of linear filter. This is being proposed both by Vision people by Adelson and Bergen, a

long time ago in the '80s, and by neuroscientists. So, neuroscientists had tried to take simple

cells, my point vaguely about, and look at what polynomial they're output is of their

input, and Yang Dan at Berkeley says it's between 1.7 and 2.3, and that's means two.

So, this looks quite like models that were proposed for quite different reasons and it

just drops out of taking a 3-way imaging model and factorizing it. The advantage we have

is that we have a learning algorithm for all these weighs now, when we generative model.

So now we can model covariances between pixels, and the reason that's good isâ€“-well, here's

one reason why it's good. Suppose I asked you to define a vertical edge. Most people

will say, "Well, vertical edge is something that light on the side and dark on that side.

Well no, maybe it's light on this side and dark on that side, but you know. Well, it

could be light up here and dark down there, and dark up here and light down there." Okay.

Or it could be texture edge. It's getting--oh, it might actually be a disparity edge. Well,

the manner should be motion this side and no motion that side. That's a vertical edge

too. So, a vertical edge is a big assortment of things, and what all those things have

in common is vertical edge is something where you shouldn't do horizontal interpolation.

Generally new image, horizontal interpolation works really well. A pixel is the average

of its right and left neighbors, pretty accurately almost all the time. Occasionally it breakdowns,

and the place it breaks down is where there is a vertical edge. So, a real abstract definition

of vertical edge is breakdown of horizontal interpolation. And that's what our models

are going to do. A hidden unit is going to be putting in interpolation, and it's actually

going to turn-off sort of reverse logic, when that breaks down its going to turn-off, so

one way of seeing it is this. If this hidden unit here is on, it puts in a weigh between

pixel I and pixel J that's equal to this weight times this weight, times this weight. Okay.

Since theseâ€“okay, that's good enough. So, these are controlling affectively the mark

of random field between the pixels, so we can model covariances nicely. Because the

hidden units are creating correlations between the visible units reconstruction is now more

difficult. We could reconstruct one image given the other image, like we did with motion,

but if you want to reconstruct them both and make them identical it gets to be harder.

So, we have to use a different mathical Hybrid Monte Carlo. Essentially you start where the

data was and let it wonder away from where it was but keeping both images the same. And

I'm not going to go to Hybrid Monte Carlo, but it works just fine for doing the learning.

And the Hybrid Monte Carlo is used just to get the reconstructions and the learning algorithm

just the same as before. And what we're going to do is we're going to have some hidden units

that are using these 3-way interactions to model covariances between pixels, and other

hidden units are just modeling the means. And so we call-â€“for meaning covariance,

we call this mcRBM. Here's an example of what happens after it's learned on black and white

images. Here's an image patch. Here's its reconstruction of the image patch, if you

don't have noise, which is very good, from the mean and covariance units. Here's the

stochastic reconstruction which is also pretty good. But now we're going to do something

funny, we're going to take the activations of the covariance units. The things that are

modeling which pixels are the same with which other pixels and we're going to keep those.

But we are going to take the activations of the mean units, so we're going to throw those

away, and pretend that the means from the pixels look like this. Well, let's take this

one first. We tell all the pixels have the same value, except these are which are much

darker and it now tries to make that information about means fit in with these information

about covariances which is of these guys should be the same but very different from these

guys. And so, it comes up with the reconstruction that it looks like that. Where you see it's

taken this dark stuff and blurred across this region here. If we just give it four dots

like that, and the covariance matrix you've got from there, it'll blur those dots out

to make an image that looks quite like that one. So this is very like what's called the,

kind of watercolor model of images, where you know about where the boundaries are and

you just, sort of, roughly sketching the colors of the regions and it all looks fine to us,

because we sort of slaved the color boundaries to the actual--where the edges are. If you

reversed the colors of these it produce the reversed image because the covariance doesn't

care at all about the signs of things. If you look at the filters at your lens, the

mean units which are for sort of coloring in regions, learn these blurry filters--and

by taking some combination of a few dozen of those you can make more or less of what

other colors you like anywhere. So, very blur there-â€“smooth, blurry, and multicolored

and you can make roughly the right colors. The covariance units learn something completely

different. So, these are what the filters learned and you'll see that, those factors,

they learn high frequency black and white edges. And then a small number of them, turning

to low frequency color edges that are either red, green, or yellow blue and what's more

when you make it from a topographic map using a technique I'll describe on the next slide.

You get this color blob, this low frequency color blob in with the low frequency black

and white filters. And that's just what you see in a monkey's brain, pretty much. If you

go into a monkey's brain you'll see these high frequency filters whose orientation changes

smoothly as you go through the cortex tangentially, and you'll see these low frequency color blobs.

Most neuroscientists thought that at least must be innate. What this is saying is, "Nope.

Just the structure of images is, and the idea of forming a topographic map, is enough to

get this." That doesn't mean it's not innate, it just means it doesn't need to be. So the

way we get the topographic map is by this global connectivity from the pixels to the

factors. So the factors really are learning local filters. And the local filters start

off colored and gradually learn to be exactly black and white. Then there's local connectivity

between the factors in the hidden units. So one of these hidden units will connect to

a little square of factors and that induces a topography here and the energy function

is such that when you turn off one of these hidden units to say smoothness no longer applies,

you pay a penalty. And you derive to just pay the penalty once. And so two factors is

going to come on at the same time, it's best to connect them to the same hidden unit so

you only pay the penalty once. And so that will cause similar factors to go to similar

places in here when we get a topographic map. For people who know about modeling images,

as far as I know, nobody has yet produced a good model of patches of color images. That

is the genres of model that generates stuff that looks like the real data. So, here's

a model that was learned on 16x16 color images from the Berkeley database and here's these

generated from the model. And they look pretty similar. Now, it's a partly a trick, the color

balance here is like the color balance and it makes you think they are similar. But,

it's partly real. I mean, most of these are smooth patches of roughly uniform color as

are most of these. These are few more of these as smooth than those. But you also get these

things where you get fairly sharp edges, so you get smoothness than a sharp edge than

more smoothness, like you do in the real data. You even get things like corners here. We're

not quite there yet but this is the best model there is in patches of color images. And it's

because it's modeling both the covariance and the means, so it's capable of saying,

"What's the same as what?" As well as, "What the intensities are?" You can apply it for

doing recognition. So this is a difficult object recognition to ask where this 80 million

unlabelled training images; not only of these classes but of thousands and thousand of classes.

They were collected by people at MIT. It's called the Tiny Images database. They're 32x32

color images. But it's surprising what you can see in a 32x32 color image. And since

the biggest model we're going to use has about a hundred million connections, that's about

0.1 of a cubic in ratio of cortex in terms of the number of parameters, and so we have

to somehow give our computer model some way of keeping up with the brain which has a lot

more hub, right? And so we do it by giving it a very small retina. We say, "Suppose the

input was only 32x32, maybe we can actually do something reasonable there." So as you'll

see there are a lot of variations. If you look birds, that's a close up of an ostrich,

this is a much more typical picture of a bird. And it's hard to tell the difference within

these tiny categories. Particularly things like deer and horse. We deliberately chose

some very similar categories like truck and car, deer and horse. People are pretty good

at these. People won't make very many errors. That's partly because these were hand-labeled

by people, so. But even people make some of errors. We only have 50,000 training examples.

Five thousand of each class and ten thousand test examples, because we have to hand-label

them, but we have a lot of untrained--unlabelled data. So we can do all these pre-training

on lots of unlabelled data and then take out covariance units on our mean units and just

try doing multi-[INDISTINCT] on top of those, or maybe add another hidden layer and do it

on top of that. So, what Marc'Aurelio Ranzato actually did since he worked in Yann LeCun's

lab, he actually took smaller patches learned the model and then strode them across the

image and replicated them. So it's a sort of a semi-convolutional. And then took the

hidden units of all of these little patches and just concatenated them to make a great

big vector of 11,000 hidden units which are both the means on the covariances. And then

we're going to use that as our features and see how well we can do. And we're going to

compare it with various other methods. So the sort of first comparison, you just take

the pixels and do logistic ration on the pixels to slide on the tiny glasses. You get 36%

right. If you take GIST features which has developed by Torralba and the people at MIT,

which were meant to capture what's going on under the image quite well, but they're fairly

low dimensional, you get 54%. So they're much better than pixels. If you take a normal RBM

which has linear units with glass and noises input variables and then binary hidden units,

and then you use those binary hidden units to do castigation, you get 60%. If you use

one of these RBMs with both the units like these once for doing the means, and then these

units with the three range interaction for modeling covariances, you got 69%; as long

as you use a lot of these factors. And if you then learn an extra hidden layer of 8,000

units--so now it's just that times that is a hundred million, so there's an extra hundred

million connections you learn there. But that's fine because it's unsupervised then you just

learn it on lots of data. You get up to 72%. And that's the best result so far on this

database. One final thing, you can take this model that was develop for image patches and

the student that'll be doing framing recognition just took that code and applied it to log

spectrograms, which is sort of more close to what they would like to see, you're not

using all these mark up fool stuff, which is designed to throw away stuff you think

you don't need and get rid of lots of correlations. Instead you're going to take data that has

lots of correlations in but we got a model that can deal with that stuff now. And the

first thing George tried on February the 20th, which was four layers of a thousand hidden

units on top of this, he got 22.7 correct--percent correct; which was the record for phoneme

recognition on the TIMIT database where your not trying to do a model adapted to each speaker.

And then a week later when he did that to TIMIT and use more frames, he was down to

21.6%. So this--all this stuff was designed to do vision. It wasn't designed to do phonemes.

And if we treat phoneme recognition, it's just a vision problem on a lot of spectral

ground. We can wipe out the speech class, at least on small vocabulary. Another student

is now, at Microsoft, is seeing if this will work on big recovery as well.

>> [INDISTINCT] >> HINTONS: Yes. Yes, right.

>> We can give them new better tools. >> HINTONS: We can give them new and better

tools. So here's phoneme recognition over the years. Backprop from the 80's got 26.1

percent correct. Over the next 20 years or so, they got that down to 24.4 percent, using

methods that weren't learning-inspired so we'll call them artificial. We then got down

to 21.6 percent; an estimate of human performance is about 15 percent. I don't know much about

how they did this estimate, I'm afraid. But we're about--we're nearly a third of the way

from artificial TIMIT. And so we need two more ideas and we're there. Okay, I'm done.

I'm finished. >> Questions?

>> HINTONS: Yes? >> You mentioned YouTube recently announced

that the [INDISTINCT] have broken the world record on the end list of data sets of phoneme

recognition by simply using a seven layered feed forward network trained with backprop,

but doing it on a GPU with lots and lots of cycles.

>> HINTONS: Yes, he did indeed announce that. What he didn't announce was--he's got a spectacular

result. He gets timed to 35 errors. What he didn't announce was there's two tricks involved.

One trick is to use a big net with lot of layers in a GPU board. That trick by itself

wouldn't give you 35 errors. There's a second trick which was sort of pioneered by people

at Microsoft in fact, which is to put a lots of work into producing distortions of the

data so you have lots and lots of labeled data. So you take a labeled image of a two

and you distort it in clever ways and make it still look like a two but be translated

so people can then get down to about 40 errors. >> I think they patented that already.

>> HINTONS: Good. So Dick's already patented that. So you get down to--you can get down

to by 40 errors by doing these distortions. What he did was even better distortions, or

more of them, and a much bigger net on a GPU and he got from a 40 to 35, which is impressive

because it is hard to make any progress there. But it won't work unless you have a lot of

labeled data. And what's--the disguised thing is the work went into--if you look in the

paper, it's always straightforward, its just backprop, except when you get to the section

of how they generated all those sector labeled data where there's very careful things, like

if it's a one or a seven they'd only rotate it a certain number of degrees but if it's

something else they rotate it in more degrees. I'm actually the referee for this paper but

I don't mind him knowing. I think it's a very important work. But he should emphasize that

they have to have labeled data to do that, and they have to put work into distortions.

So for me the lesson of that paper is when we small computers, you should put your effort

into things like weight constraints so you don't have too many parameters because you

only got a small computer. As computer gets bigger and faster, you can transfer your effort

from, instead of tying the weights together, like Yann was doing in the early days, put

your effort into generating more distortions so you can inject your prior knowledge in

the form of distortions and that's much less complication-efficient over the big computers,

it's fine and it's more flexible. So I think that's the lesson of that paper.

>> I shouldn't even need to ask you a question, you answered it. Thank you.

>> HINTON: Any other long question? >> It seems like you've invented some kind

of a cortex here that house you expect the property that if it does vision it'll do sound.

>> HINTONS: Yes. >> What other problems you going to apply

it to? >> HINTONS: Maybe it'd be quicker to say the

problems we're not going apply. >> Okay.

>> HINTONS: I can't think of any. I mean--okay, let me say what the main limitation of this

is for vision. We got at least 10 billion neurons for doing visual things; or at least

a billion anyway, probably, 10 billion. And even if we got that many neurons and about

10 to the 13 connections for doing vision, we still have a retina that's got a very small

phoneme the size of my thumb there at arms length. And so we still take almost everything

and don't look at it. I mean, the essence of vision is not to look at almost everything

intelligently; and that's why you got all this funny illusions where you don't see things.

We have to do that in these models. These models are completely crazy. And all of computer

visions are completely crazy, almost all of it. Because they take a uniform resolution

image, and quite a big one like a thousand by thousand, and they try and deal with it

all at once with filters all over the image. And if they going to do a selection, they

either do it by running off their face to get to everywhere, with no intelligence, or

they do sort of interest point detection at a very low level to decide what to attend

to. What we do is we fixate somewhere. Then on the basis of what our retina gives us,

with these big pixels around the edges and small pixels in the middle, we sort of decide

what we seeing and where to look next and by the second or third fixation we've fixating

very intelligently and the essence of it is that vision is sampling, it's not processing

everything; and that's completely missing from what I said. Now in order to do that,

you have to be able to do take what you saw and where you saw it and combined them and

that will multiply. So this module, it can do multiply. It's very good in combining what's

and where's, to integrate information at a time. And that's one of the things, we're

working on that. But that's probably the biggest thing missing. But that is an example of having

a module is quite good but now it's never good enough, so you have to put it together

over time and use it many times. And that's what sequential reasoning in all this stuff

are. So basically, as soon people become sequential we're not modeling that at all. We're modeling

what you can do in hundred milliseconds. And so that's what's missing. But I believe that

to model that sequential stuff we need to understand what is the sequence of, is the

sequence of these very powerful operations. And we're in a better shape now to try and

model sequential AI, than we were if we didn't know what a primitive operation is. So this

sort of primitive operation was just deciding whether two symbols are the same. We're going

to be out of luck for understanding how people do sequential stuff. Yeah.

>> This is a [INDISTINCT] question as he said he wanted to do everything if it connects.

Are you going to do [INDISTINCT] logic like there exists a God and every girl has a boy

she loves? >> HINTON: Hang on, I'm still processing that.

Right. Right, I'm making the point that people find "quantifies" quite difficult.

>> Oh, yeah. If you [INDISTINCT] quantifiers... >> HINTON: I would love to do that. I have

not got a clue how to do it. And you will notice that in old-fashioned AI that you used

to point out to [INDISTINCT] people, then you can't do quantifiers, so forget it. Nowadays,

when they all do graphical models, they didn't mention that anymore because the graphical

models have difficulty of it too. Some people has got [INDISTINCT] some people do. Right.

Yeah, some people do. But most of the graphical models of, like, five years ago, they do quantifiers

either. And so, a pretty good division line would be what you can do without having to

deal with really sophisticated problems like that. I would love to know how we deal with

that, but I don't. >> Thank you.

>> HINTON: So, yeah, I'm going to give up on that right now.

>> So if you had 80 million labeled images and no extra unlabeled ones, would you do

your pre-training... >> HINTON: Yes. Yes

>> ...and then fine tuning to make us better? >> HINTON: In TIMIT, that's what we have.

In TIMIT, all the examples we have labels. It stirs a big wind to do the pre-training.

>> But you didn't sneak this result I'm just hearing about? It seems to suggest...

>> HINTON: Well, the audience switched it but I haven't tried with all these distortions

during pre-training. Now, I've assumed student called [INDISTINCT] who just produced a thesis.

Well, he tries things like that. He tries distortions in earnest and he uses special

distortions of his own. And the fact is distortions helped a lot. But if you do pre-training,

that helps some more too. And [INDISTINCT] results, yes, [INDISTINCT] results, suggest

that pre-training will get you through different part of the space even if you have all these

labeled data. So clearly, one thing that needs to be done is to try the pre-training and

combine with these labels. You don't have to have the pre-training, but I bet you, it

still helps. And I bet you, it's more efficient too. It's faster because the pre-training

is rather pretty fast, you always have to learn a very good model. You got lots of its

features. And starting from there, I think, you'll do better than he does just started

from random, and faster. That's just a prediction. You might even get done to 34 out of this.

The problem with [INDISTINCT] you can't get significance. TIMIT is really nice that way.

They designed it well, so you get higher rates. So you can see differences.

>> On the time series aspect, did you see anything that would you get inferences or

alterations that are beyond the size of the time window you're using?

>> HINTON: Sorry, I didn't understand the question. We have a limited time. We don't...

>> You have limited time, after training is there anything of a model that picks up...

>> HINTON: Nothing. >> Nothing.

>> HINTON: Nothing. It cannot deal with--it can't model host...

>> It has an internal state. It has an internal state.

>> HINTON: Right. But if sort of what happened 15 times steps ago really tells you what should

happen now, and it only tells you what you should happen now. It doesn't tell you what

should happen in TIMIT 14 times steps. It just contains information across 15 times

steps without having a signature of smaller time scales. You can't pick up on that.

>> Okay. >> HINTON: Because it's not got a hidden forward-backward

algorithm. A forward-backward algorithm potentially could pick up a lot of load, actually can't.

>> So this one wouldn't pick up on things like object permanence or all rules behind

the box and comes out of the other side and they're not going to be able to...

>> HINTON: Not over a long time scale, no, no. Unless you say that there's a memory involved

when you go back to a previous--it gets more complicated, right? Now, it is true that when

you build the multilevel one, which you can do with the three interconnections as well

as with the three-way connections, at every level you're getting a bigger time span because

your time window, it's going further back into the past with each level. So you get

a bit high, but that's just sort of linear. >> Can you say--do you have any rules of thumb

of how much unlabeled data you need to train each of the different levels and how it would

change, like, is it just linear with the number of rates or as you go up levels the things

changed? >> HINTON: I have one sort of important thing

to say about that, which is that if you're modeling high-dimensional data and you're

trying to build an unsupervised model of the data, you need many less trainings on [INDISTINCT]

than you would have thought if you use the discriminative learning. When you're doing

discriminative learning, there's typically a very few bits per training case to constrain

the parameters. You're going to constrain--you got many new parameters for a training case

is the number of bits it takes to specify the answer, not the number it takes to specify

the input. So within this, you get 3.3 bits per case. If you're modeling the image, the

number of bits per case is the number of bits it takes to specify to image which is about

a hundred bits. So you need far fewer cases per parameter. In other words what I'm saying

is you're modeling much which are things, and so each case is giving you much more information.

So actually, we can typically model many more parameters than we have training cases. And

discriminative people aren't used to that. Many less parameters than we have pixels and

many more than training cases. And in fact, he used about two million cases for doing

the image stuff, and it wasn't enough, it was over fitting. He should have used more.

But he was fitting 100 million parameters. But the--basically, the only rule of thumb

is many less parameters and the number of total number of pixels in your training data,

but you can typically use many more parameters in the number of training cases. And you can't

do that with normal discriminative learning. Now, if you do do that, when you start discriminative

training, it quickly improves things and then very quickly over fits. So you have to stop

it early. Okay. >> Okay?

>> HINTON: Thanks. >> Let's thank the speaker again.

>> Thank you.