Hello and welcome to another Beginner's Guide

to Machine Learning with ml5.js video.

This is a video.

You're watching it.

And I am beginning this journey to talk about, and think about,

and attempt to explain and implement

convolutional neural networks.

So this is something that I refer to in the previous video,

where I took the pixels of an image

and made those the inputs to a neural network

to perform classification.

And I did this in even earlier videos with pretrained models.

And I mentioned that those pretrained models included

something called a convolutional layer,

but my example didn't include a convolutional layer.

So ml5 has a mechanism for adding convolutional layers

to your ml5 neural network.

But before I look at that mechanism, what

I want to do in this video and in the next one

is just explain what are the elements

of a convolutional neural network,

how do they work, and then look at some code examples that

actually implement the features of that convolutional layer.

I'm not going to build from scratch

a full convolutional neural network.

Maybe that's some other video series that I'll do someday.

We're going to use the fact that the ml5 library just

makes that possible for you.

In the first part I will just talk

about from the zoomed out view, what a convolutional layer is,

then I will look at with code, this idea of a filter.

In the second part, I'll come back

and look at this other aspect of a convolutional layer called

pooling.

I hope you enjoy this and you find it useful.

And I'll see you--

I'll be back in this outfit at the end of the video.

Let me start by diagramming what the neural networks looked

like with ml5 neural network to date in the videos

that I've made.

So there's been two layers--

a hidden layer and an output layer--

and then also there's some data coming into the neural network.

And in this case, in the previous example,

it was an image, which was flattened.

So I used the example of 10 by 10 pixels, each with an R, a G,

and a B. So that made an array of 300 inputs.

All these pixel values, those are the inputs.

And those go into the hidden layer.

But just for the sake of argument,

let me simplify this diagram and I'm just

going to consider an example with four inputs.

I'm going to consider that example

as having five hidden nodes--

hidden units.

And then let's say, it's a classification problem

and there's three possible categories.

So when I call the function ml5.neuralNetwork,

it creates this architecture behind the scenes

and connects every single input to every hidden unit

and every hidden unit to each output.

[MUSIC PLAYING]

So this is what the neural network looks like.

Each one of these connections has

a weight associated with it.

Each unit receives the sum of all

of the inputs times the weights passed through an activation

function, which then becomes the output, which then all of those

with those weights are summed into the next layer,

and so on and so forth.

So this is what I have worked with before.

While in the previous example, I was

able to get this kind of architecture

to work with image input and get results that produced something

in the output, this can be improved upon.

There is information in this data that's

coming in that is lost when it is flattened

to just a single flat array.

And the information that's lost is

the relative spatial orientation of the pixels.

It's meaningful that these colors are near other colors.

Something in what we're seeing in the image

has to do with the spatial arrangement of the pixels

themselves in two dimensions.

In order to address that, we want

to add into this architecture--

I really spent a lot of time drawing this diagram,

which I'm now going to mostly erase--

we want to add something called a convolutional layer.

So in this video, I want to explain what are the elements.

There are units, nodes, neurons, so

to speak, in a convolutional layer, but what are they?

And the word that's typically used

is actually called a filter, which makes a lot of sense.

Now, convolutional neural networks

can be applied to lots of scenarios besides images

and there's a lot of research into different ways

that they can be used effectively,

but I'm going to stick with the context of working

with images because the word "filter" really fits with that.

We're filtering an image.

How is this layer filtering an image?

So the idea of a convolutional layer is not a new concept,

and it predates the era that we're

in now of so-called deep learning.

And if you want to go back and look at the origins

of convolutional neural networks,

you can find them in this paper called "Gradient-Based Learning

Applied to Document Recognition" from 1998.

Section two, convolutional neural networks

for isolated character recognition.

And here, we can see this diagram, which

is I'm attempting to kind of talk through

and create my own version of over here on the whiteboard

itself.

This is also the original paper associated

with the MNIST dataset--

a dataset of handwritten digits that's

been used umpteen amounts of times in research papers

over the years related to machine learning.

I know I'm going back and forth a lot here,

but let's go back to thinking of the input

as a two-dimensional image itself.

So this two-dimensional image--

and let's not say it's 10 by 10.

Let's use what the MNIST dataset is, which

is a 28 by 28 pixel image.

And of course now, much higher resolution images are used.

And this is what is coming in to the first convolutional layer.

This image is being sent to every single one

of these filters.

A filter is a matrix of numbers.

And let's just, for example, let's have a 3 by 3 matrix.

Each one of these filters represents nine numbers--

a matrix that's 3 by 3.

You could have a 5 by 5 filter and so on and so forth,

but it a sort of standard size or a nice example size for us

to start with is 3 by 3.

Each one of these filters is then applied to the image

through a convolutional process.

This by the way, is not a concept

exclusive to machine learning.

This idea of a convolutional filter to an image

has been part of image processing,

and computer science, and computer vision algorithms

for a very long time.

To demonstrate this, let me actually open up--

I can't believe I'm going to do this,

but I'm going to open up Photoshop.

So here I am in Photoshop and I've

opened this image of a kitten.

And there's a menu option called Filter.

This word is not filter by accident.

There's a connection.

So all of these types of operations that you might do--

for example, like blur an image--

these are filters-- convolutions applied to the image.

I'm going to go down here under Other and select Custom.

All of a sudden, you're going to see here,

I have this matrix of numbers.

This matrix of numbers in Photoshop

is exactly the same thing as this matrix of numbers

I'm drawing right here.

Each one of these filters in the convolutional layer

represents a matrix of numbers that

will be applied to the image.

So let me actually just put some numbers in here.

[MUSIC PLAYING]

This particular set of numbers happens

to be a filter for finding edges in an image.

And you can think of it as these are

all weights for a given pixel.

So for any given pixel, I want to subtract colors

that are to the left of it and emphasize

colors that are at that pixel and above and below.

This draws out areas of the image

where the neighboring pixels are very, very different.

Interestingly enough, I could switch these to 0.

[MUSIC PLAYING]

Switching the filter to have the negative numbers on the top,

you can see now I'm still detecting edges,

but I'm detecting horizontal edges.

If you go back and look at the cat

that I had previously versus this one,

you can see vertical edges versus horizontal edges.

So there are known filters, which draw out

certain features of an image.

And that's exactly what each one of these filters does.

If all of the nodes of a neural network

can draw out and highlight different aspects of an image,

those can be weighted to indicate and classify

the image in certain ways.

The big difference between a convolutional layer,

and a neural network, and what I'm doing here

by hardcoding in sort of known filters

is that the neural network is not

going to have filters hardcoded into them.

It's going to learn filters that do a good job of identifying

features in an image.

This relates to the idea of weights, I think.

So if I go back to my previous diagram,

where every single input is connected

to each hidden neuron with a weight,

now the input image is connected to every single one

of these filters.

In a way, there are now nine weights for every single one.

Instead of learning a single weight,

it's going to learn a set of weights for an area of pixels

to identify a feature in the image.

All of these filters will start with random values, and then

the same gradient descent process--

the error backpropagating through the network,

adjusting all the dials, adjusting all the weights

in these matrices and all of these filters--

works in the same way.

So in the ml5 series, I haven't really

gone through and looked at the gradient descent learning

algorithm to adjust all the weights in detail.

I do have another set of videos that

do that if you're interested, but the same gradient descent

algorithm that is applied to these weights

is applied to all of the different values

in each one of these filters.

Incidentally, just to show a very common convolution

operation to blur an image, blurring an image

is taking the average of a given pixel and all of its neighbors.

So here, you can see if I give the same weight to a 5 by 5

matrix of pixels around a center pixel,

and then divide that scale-- let's divide by 25

because there's 25--

that's averaging all of the colors.

If I click on Preview, blurred, not blurred,

blurred, not blurred.

Of course, there are other more sophisticated convolutions,

like a Gaussian blur.

You can take a look a Gaussian blur.

There's different ways to pronounce it.

You can take a look and research what that is,

but again, I'm not going down the road

to look at common image processing convolutions.

Instead, talking about the concept of a convolution as

applied to an image in the process

of a convolutional neural network.

Just to take this a little bit further,

I'm going to demonstrate how to code the convolution

algorithm in p5.js.

In truth, ml5 and TensorFlow.js are

going to handle all of the convolution operations for us

and creating all the filters.

We're just going to configure a convolutional layer

from a high level.

But I think it's interesting to look

at how you might code an image processing algorithm in p5.

I have some videos that do things like this previously,

but let's look at it in this context.

So I took a low resolution 28 by 28 image of a cat.

This comes from the Quick Draw dataset, which I've made videos

about before and I will also use to see

if we can create a doodle classifier as part

of this series.

And all I want to do is apply a convolution to that image.

So first, I'm going to create a variable

and I'm going to call it filter.

So this is going to be our filter.

And I'm going to make it a two-dimensional array.

So let me just put all zeros in it to start.

So this is the filter.

And let's go with that one that looks for edges.

The cat image is actually quite low resolution,

just 28 by 28 pixels, but I'm drawing it at twice the size.

I want to write the code to apply this filter to the image

and draw the filtered image to the right.

I'm going to create a variable called dim for dimensions

and just call this 28.

And then I want another variable to store the filtered image.

And in setup, I can create that image.

This creates a blank image of the same dimensions

as the original cat drawing.

Then I can write a loop.

And this loop is going to look at every single pixel

for all the columns x and all of the rows y.

And I wrote int there because I'm half the time

programming in Java.

But one thing that's important here,

if we're going to take this 3 by 3 matrix

and apply it to every single pixel of the original image,

if we're applying it to that first pixel 0,0,

there's no pixel to the left and no pixel above it.

It doesn't have all of its neighbors.

So there's various ways around this.

I'm just going to ignore all the edge pixels.

So the loop will go from 1 to dimensions minus 1.

Now, there's a lot more work to be done here just to apply

this filter to any given pixel.

I think a way that might make sense

to do this is to actually have a new function.

I would call the function filter--

let's just call it convolution.

I'm going to write a function called convolution.

It receives an image, an x and a y, and a filter,

and it returns a new color.

So the idea of this function is that it receives

all the things it needs.

It receives the original image, the filter to apply to it,

which particular pixel we want to process,

and then will return back to new RGB value

after that pixel is processed.

And the reason why I'm doing that in a separate function

is I need another nested loop to go over the filter.

So I need to go from 0 to 3--

0, 1, 2 columns in the filter, 0, 1, 2 rows in the filter.

And it would be getting to be quite a lot

if I had four nested loops right in here.

Now, I probably shouldn't have some

of this hardcoded in here-- the number 3

and that sort of thing-- but you can

imagine how you might need to use variables

if the filter size is flexible.

Now, we have a really sort of like sad fact, which

is true about most cases where you're doing image

processing with some framework.

And in this case, our framework is JavaScript, and canvas,

and p5.js.

And the sad fact is though even though all of this is built--

all of this discussion is built upon the fact

that we are retaining the spatial orientation

of the pixels.

We're thinking of it as a two-dimensional matrix

of numbers.

The actual data is stored in one array.

And so I've gone over this in probably countless videos,

but there's a simple formula to look at if I have a given x,y

position in a two-dimensional matrix,

how do I find the one-dimensional lookup

into that matrix, assuming that the pixels were counted

by rows--

0, 1, 2, 3, 4, 5, 6, 7, blah, blah, blah, next row, 28, 29,

30, blah, blah blah.

And that formula is let index--

oh, well, I need to do that before this nested loop

because right now, I just want the center pixel-- that x,y.

Let index equal x plus y times img.width.

But there's more, oh!

So this is the form.

And if you think about it, it makes sense

because it's all the x's, and then the

offset along the y's is how many rows

times the width of the image.

But there's another problem, which

is that in JavaScript in canvas, for every single pixel

in this image, there are actually

four numbers being stored--

an R, a G, a B, and an alpha--

the red, green, and blue channels

and the alpha channels--

channel, singular.

So each pixel takes up four spots.

So this index actually needs to say times 4.

So guess what?

You know it's going to make a lot of sense.

I'm going to need this operation a lot.

Let's write a function for it.

I'll just call it index, and it receives an x, y, and a width,

and it returns--

you know what?

The width is never going to change in my sketch,

so I don't want to be so crazy as to have

to pass it around everywhere.

So we're just going to pull it from a global variable.

Return x plus y times img.width.

And that's not img, it's cat.width.

OK, so once again, this is terrible what I'm doing,

but I'm just saving myself a little bit of heartache

here and there.

So this index-- ooh, let's call this pixel.

Oh, and this should be times 4.

This pixel is that function index x,y.

Now, I have something I could do to simplify this,

but I might as well write the code for if this

were a full RGB image.

This is a grayscale image, but it has all the channels in it.

The thing that I need to do to perform this convolution

operation is to take all of the weights--

the numbers that are in the filter matrix--

and I need to multiply each one times the pixel value of all

of the neighbors and their corresponding locations,

add them all up together, and maybe divide by something

if I wanted to sort of, like, average it out.

But in this case, I actually don't want to divide

by anything.

I'm just going to leave the weights are the weights are

the weights are the weights.

And actually, this right here is irrelevant.

I need to do this inside the loop.

You'll see in a second.

I think it's going to make sense.

So I need sum.

I'm going to make a sum of all the R values,

a sum of all the green values, and a sum of all

the blue values.

All right, wait a sec, wait a sec, wait a sec.

Actually, I think this is going to make more sense.

Let's go from negative 1 to 2.

You'll see why.

I mean, I'll explain why.

And negative 1 to 2.

Let's do that instead.

And maybe it's more clear to say less than or equal to 1.

Less than or equal to 1 because--

and let me draw this diagram once again--

if this is pixel 0,0, this is pixel negative 1, negative 1.

This is 1,1.

This is 1,0.

This is 1, negative 1.

I guess I'll do them all.

So you can see that the neighboring

pixels are offset by negative 1 and 1, and negative 1 and 1.

So the pixel x value is x plus i.

The pixel y value is y plus j.

And then the pixel index is call the index function

x, which returns the actual index into that array

for pixel x and pixel y.

And actually, maybe it makes more sense for me

to just say that I don't necessarily

need separate variables.

It might actually be just as clear just

to put this right in here.

So now, I just need to add the red, green, and blue values

of this particular pixel to the sum.

So sumR plus equal img.pixels at that pixel index.

And then G and B. G is the next one,

and B, blue, is the next one.

And let's add a plus 0 here just to be consistent.

So ultimately, what I'm actually returning here is r is sumR,

g is sumB, and b is sum--

oh, sorry, g is sumG and b is sumB.

So this is the process now of adding up all the pixels.

I've gone through every single pixel in a 3

by 3 neighboring area and added up

all the reds, greens, and blues, and I'm returning those back.

But I'm missing the crucial component, which

is as I'm adding all the pixels up in that area,

I need to multiply each one by the value in the filter itself.

Incidentally, I should also mention

that the operation that this really is is the dot product,

and in an actual machine learning system,

all this would be done with matrix math,

but I'm doing it sort of like longhand just

to sort of see the process and look at it.

What should I call this in the filter, like the factor?

Now, I need to look up in the filter, i,j.

Only here's the thing--

because I decided to go from negative 1 to 1,

negative 1 to 1, the filter doesn't

have those index values.

It goes 0, 1, 2, 0, 1, 2.

So this has to be i plus 1, j plus 1.

So it's all six of one, half dozen of the other,

whether I go from 0 to 2 there and do

the offset in the pixels.

But the point is the pixel array,

I'm looking actually to the negative and positive

to the left and right, but the filter is just a 3

by 3 array starting with 0,0 on the top left.

So now, I should be able to multiply by factor.

And there we go.

I have the full convolution operation.

Now, I might have made a mistake here.

I think this is right.

When I run it, we'll find out if I made a mistake.

I'm summing up a 3 by 3 neighborhood of pixels,

all multiplied by weights that are in a 3 by 3 filter.

Oh, but I actually have to call that function here.

Now, it should be relatively easy because all of the work

was in there.

So if I say let I'm just going to call this rgb

equal convolution, the cat at the given x and y

with the filter, then the new image, which is called filter--

oh.

I have to look up.

It's OK.

No problem.

The pixel is index x,y, and then filter--

so I have to look up the one-dimensional location

in the new image, and then at .pixels at that pixel is

the rgb--

the red value that came back plus

0 plus 1 plus 2, green and blue.

And then if all goes according to plan,

I should be able to draw the filtered image

at offset to the right with the same size.

I did miss something kind of important,

which is that if I am working with pixels of an image in p5,

I need to call loadPixels.

So cat.loadPixels filtered.loadPixels.

And then I haven't changed the pixels of the original cat

image, but since I changed the pixels of the filtered image,

afterwards I need to call updatePixels.

And now is the moment of truth.

[DRUM ROLL]

Never good when I press the snare drum button.

I'm going to run the sketch.

Whoops.

All right, well, I've already got an error.

[SAD TROMBONE]

Cannot read property loadPixels.

Oh, filter, filter, filtered.

That should be filtered.

Also this isn't right-- createCanvas.

The size of the canvas is times 10 times 2 times 10.

Remember, the image is just 28 by 28.

Let's try this again.

[DRUM ROLL]

[SAD TROMBONE]

Well, a little bit better.

We didn't get any errors.

I don't see an image.

Do I need to give it a hardcoded transparency of 255?

Yes.

[BELL] Oops.

So it was fully transparent.

So I'm not pulling the transparency over.

I could pull it over, but I just know

I don't want it to be transparent.

Look at that.

Look at how it found the--

oh, oh, oh, oh.

Look at this.

That doesn't look like it's finding the vertical edges--

pixels that are different to the left.

It looks like it's finding horizontal edges.

Even though I've typed this out in a way

that visually, these negative 1's appear in a column,

it's actually those correspond not to the j index,

but to the i index.

So I think one way to fix that would just be to swap it here.

And maybe there's like a more elegant way of doing this,

but this now, if I run it this way,

you'll see, ah, look at those horizontal edges.

So now, we see how this convolution

is applied to the image.

The difference in the neural network here--

the convolutional neural network--

is we're not hardcoding in specific filters

that we know highlight things in an image.

The neural network is going to learn

what values for the filters highlight

important aspects of the image to help the machine learning

task at hand, such as classification.

So it might draw out, you know, cats

tend to have ears that appear a certain way and this kind

of filter, like, brings that out, and then leads

to the final layer of the network activating

with a high value for that particular classification.

So just to keep my example simulating the neural network

process a bit more, let's just every time I run it,

give it a random filter because that's what

the layer would begin with.

Just like a neural network begins with random weights

and learns the right weights, the filters

begin with random values and it learns optimal values.

So right here in setup, I'll write a nested loop

and give it a random value between negative and 1.

In truth, there are other mechanisms and strategies

for the initial weights of a convolutional neural network,

but picking random numbers will work for us

right now just to see.

So every time I run it, you can see

we get a different resulting image that is filtering

the image in a different way.

OK, that was a lot and I think it

would be good to take a break.

So this was the first part of my explanation,

a long-winded attempt to answer the question, what is

a convolutional neural network?

So the first thing to look at is the convolutional layer.

It's made up of filters.

And so this video attempted to explain that.

And I think we could take a break, have a cup of tea,

talk to your pet, or friend, or plant, or something, meditate,

relax.

And then if you want--

if you want, you can come back and in the next video,

I'm going to look at the next piece--

the next component of the convolutional layer,

an operation called pooling or more specifically, max pooling.

And then I'll be able to tie a little ribbon

and put a little bow on this explanation

about convolutional neural networks

and move towards actually implementing one

with the ml5 built-in functionality.

All right, so maybe I'll see you in the future

and have a great rest of your day.

Goodbye.

[MUSIC PLAYING]