Hello and welcome to another Beginner's Guide
to Machine Learning with ml5.js video.
This is a video.
You're watching it.
And I am beginning this journey to talk about, and think about,
and attempt to explain and implement
convolutional neural networks.
So this is something that I refer to in the previous video,
where I took the pixels of an image
and made those the inputs to a neural network
to perform classification.
And I did this in even earlier videos with pretrained models.
And I mentioned that those pretrained models included
something called a convolutional layer,
but my example didn't include a convolutional layer.
So ml5 has a mechanism for adding convolutional layers
to your ml5 neural network.
But before I look at that mechanism, what
I want to do in this video and in the next one
is just explain what are the elements
of a convolutional neural network,
how do they work, and then look at some code examples that
actually implement the features of that convolutional layer.
I'm not going to build from scratch
a full convolutional neural network.
Maybe that's some other video series that I'll do someday.
We're going to use the fact that the ml5 library just
makes that possible for you.
In the first part I will just talk
about from the zoomed out view, what a convolutional layer is,
then I will look at with code, this idea of a filter.
In the second part, I'll come back
and look at this other aspect of a convolutional layer called
I hope you enjoy this and you find it useful.
And I'll see you--
I'll be back in this outfit at the end of the video.
Let me start by diagramming what the neural networks looked
like with ml5 neural network to date in the videos
that I've made.
So there's been two layers--
a hidden layer and an output layer--
and then also there's some data coming into the neural network.
And in this case, in the previous example,
it was an image, which was flattened.
So I used the example of 10 by 10 pixels, each with an R, a G,
and a B. So that made an array of 300 inputs.
All these pixel values, those are the inputs.
And those go into the hidden layer.
But just for the sake of argument,
let me simplify this diagram and I'm just
going to consider an example with four inputs.
I'm going to consider that example
as having five hidden nodes--
And then let's say, it's a classification problem
and there's three possible categories.
So when I call the function ml5.neuralNetwork,
it creates this architecture behind the scenes
and connects every single input to every hidden unit
and every hidden unit to each output.
So this is what the neural network looks like.
Each one of these connections has
a weight associated with it.
Each unit receives the sum of all
of the inputs times the weights passed through an activation
function, which then becomes the output, which then all of those
with those weights are summed into the next layer,
and so on and so forth.
So this is what I have worked with before.
While in the previous example, I was
able to get this kind of architecture
to work with image input and get results that produced something
in the output, this can be improved upon.
There is information in this data that's
coming in that is lost when it is flattened
to just a single flat array.
And the information that's lost is
the relative spatial orientation of the pixels.
It's meaningful that these colors are near other colors.
Something in what we're seeing in the image
has to do with the spatial arrangement of the pixels
themselves in two dimensions.
In order to address that, we want
to add into this architecture--
I really spent a lot of time drawing this diagram,
which I'm now going to mostly erase--
we want to add something called a convolutional layer.
So in this video, I want to explain what are the elements.
There are units, nodes, neurons, so
to speak, in a convolutional layer, but what are they?
And the word that's typically used
is actually called a filter, which makes a lot of sense.
Now, convolutional neural networks
can be applied to lots of scenarios besides images
and there's a lot of research into different ways
that they can be used effectively,
but I'm going to stick with the context of working
with images because the word "filter" really fits with that.
We're filtering an image.
How is this layer filtering an image?
So the idea of a convolutional layer is not a new concept,
and it predates the era that we're
in now of so-called deep learning.
And if you want to go back and look at the origins
of convolutional neural networks,
you can find them in this paper called "Gradient-Based Learning
Applied to Document Recognition" from 1998.
Section two, convolutional neural networks
for isolated character recognition.
And here, we can see this diagram, which
is I'm attempting to kind of talk through
and create my own version of over here on the whiteboard
This is also the original paper associated
with the MNIST dataset--
a dataset of handwritten digits that's
been used umpteen amounts of times in research papers
over the years related to machine learning.
I know I'm going back and forth a lot here,
but let's go back to thinking of the input
as a two-dimensional image itself.
So this two-dimensional image--
and let's not say it's 10 by 10.
Let's use what the MNIST dataset is, which
is a 28 by 28 pixel image.
And of course now, much higher resolution images are used.
And this is what is coming in to the first convolutional layer.
This image is being sent to every single one
of these filters.
A filter is a matrix of numbers.
And let's just, for example, let's have a 3 by 3 matrix.
Each one of these filters represents nine numbers--
a matrix that's 3 by 3.
You could have a 5 by 5 filter and so on and so forth,
but it a sort of standard size or a nice example size for us
to start with is 3 by 3.
Each one of these filters is then applied to the image
through a convolutional process.
This by the way, is not a concept
exclusive to machine learning.
This idea of a convolutional filter to an image
has been part of image processing,
and computer science, and computer vision algorithms
for a very long time.
To demonstrate this, let me actually open up--
I can't believe I'm going to do this,
but I'm going to open up Photoshop.
So here I am in Photoshop and I've
opened this image of a kitten.
And there's a menu option called Filter.
This word is not filter by accident.
There's a connection.
So all of these types of operations that you might do--
for example, like blur an image--
these are filters-- convolutions applied to the image.
I'm going to go down here under Other and select Custom.
All of a sudden, you're going to see here,
I have this matrix of numbers.
This matrix of numbers in Photoshop
is exactly the same thing as this matrix of numbers
I'm drawing right here.
Each one of these filters in the convolutional layer
represents a matrix of numbers that
will be applied to the image.
So let me actually just put some numbers in here.
This particular set of numbers happens
to be a filter for finding edges in an image.
And you can think of it as these are
all weights for a given pixel.
So for any given pixel, I want to subtract colors
that are to the left of it and emphasize
colors that are at that pixel and above and below.
This draws out areas of the image
where the neighboring pixels are very, very different.
Interestingly enough, I could switch these to 0.
Switching the filter to have the negative numbers on the top,
you can see now I'm still detecting edges,
but I'm detecting horizontal edges.
If you go back and look at the cat
that I had previously versus this one,
you can see vertical edges versus horizontal edges.
So there are known filters, which draw out
certain features of an image.
And that's exactly what each one of these filters does.
If all of the nodes of a neural network
can draw out and highlight different aspects of an image,
those can be weighted to indicate and classify
the image in certain ways.
The big difference between a convolutional layer,
and a neural network, and what I'm doing here
by hardcoding in sort of known filters
is that the neural network is not
going to have filters hardcoded into them.
It's going to learn filters that do a good job of identifying
features in an image.
This relates to the idea of weights, I think.
So if I go back to my previous diagram,
where every single input is connected
to each hidden neuron with a weight,
now the input image is connected to every single one
of these filters.
In a way, there are now nine weights for every single one.
Instead of learning a single weight,
it's going to learn a set of weights for an area of pixels
to identify a feature in the image.
All of these filters will start with random values, and then
the same gradient descent process--
the error backpropagating through the network,
adjusting all the dials, adjusting all the weights
in these matrices and all of these filters--
works in the same way.
So in the ml5 series, I haven't really
gone through and looked at the gradient descent learning
algorithm to adjust all the weights in detail.
I do have another set of videos that
do that if you're interested, but the same gradient descent
algorithm that is applied to these weights
is applied to all of the different values
in each one of these filters.
Incidentally, just to show a very common convolution
operation to blur an image, blurring an image
is taking the average of a given pixel and all of its neighbors.
So here, you can see if I give the same weight to a 5 by 5
matrix of pixels around a center pixel,
and then divide that scale-- let's divide by 25
because there's 25--
that's averaging all of the colors.
If I click on Preview, blurred, not blurred,
blurred, not blurred.
Of course, there are other more sophisticated convolutions,
like a Gaussian blur.
You can take a look a Gaussian blur.
There's different ways to pronounce it.
You can take a look and research what that is,
but again, I'm not going down the road
to look at common image processing convolutions.
Instead, talking about the concept of a convolution as
applied to an image in the process
of a convolutional neural network.
Just to take this a little bit further,
I'm going to demonstrate how to code the convolution
algorithm in p5.js.
In truth, ml5 and TensorFlow.js are
going to handle all of the convolution operations for us
and creating all the filters.
We're just going to configure a convolutional layer
from a high level.
But I think it's interesting to look
at how you might code an image processing algorithm in p5.
I have some videos that do things like this previously,
but let's look at it in this context.
So I took a low resolution 28 by 28 image of a cat.
This comes from the Quick Draw dataset, which I've made videos
about before and I will also use to see
if we can create a doodle classifier as part
of this series.
And all I want to do is apply a convolution to that image.
So first, I'm going to create a variable
and I'm going to call it filter.
So this is going to be our filter.
And I'm going to make it a two-dimensional array.
So let me just put all zeros in it to start.
So this is the filter.
And let's go with that one that looks for edges.
The cat image is actually quite low resolution,
just 28 by 28 pixels, but I'm drawing it at twice the size.
I want to write the code to apply this filter to the image
and draw the filtered image to the right.
I'm going to create a variable called dim for dimensions
and just call this 28.
And then I want another variable to store the filtered image.
And in setup, I can create that image.
This creates a blank image of the same dimensions
as the original cat drawing.
Then I can write a loop.
And this loop is going to look at every single pixel
for all the columns x and all of the rows y.
And I wrote int there because I'm half the time
programming in Java.
But one thing that's important here,
if we're going to take this 3 by 3 matrix
and apply it to every single pixel of the original image,
if we're applying it to that first pixel 0,0,
there's no pixel to the left and no pixel above it.
It doesn't have all of its neighbors.
So there's various ways around this.
I'm just going to ignore all the edge pixels.
So the loop will go from 1 to dimensions minus 1.
Now, there's a lot more work to be done here just to apply
this filter to any given pixel.
I think a way that might make sense
to do this is to actually have a new function.
I would call the function filter--
let's just call it convolution.
I'm going to write a function called convolution.
It receives an image, an x and a y, and a filter,
and it returns a new color.
So the idea of this function is that it receives
all the things it needs.
It receives the original image, the filter to apply to it,
which particular pixel we want to process,
and then will return back to new RGB value
after that pixel is processed.
And the reason why I'm doing that in a separate function
is I need another nested loop to go over the filter.
So I need to go from 0 to 3--
0, 1, 2 columns in the filter, 0, 1, 2 rows in the filter.
And it would be getting to be quite a lot
if I had four nested loops right in here.
Now, I probably shouldn't have some
of this hardcoded in here-- the number 3
and that sort of thing-- but you can
imagine how you might need to use variables
if the filter size is flexible.
Now, we have a really sort of like sad fact, which
is true about most cases where you're doing image
processing with some framework.
And the sad fact is though even though all of this is built--
all of this discussion is built upon the fact
that we are retaining the spatial orientation
of the pixels.
We're thinking of it as a two-dimensional matrix
The actual data is stored in one array.
And so I've gone over this in probably countless videos,
but there's a simple formula to look at if I have a given x,y
position in a two-dimensional matrix,
how do I find the one-dimensional lookup
into that matrix, assuming that the pixels were counted
0, 1, 2, 3, 4, 5, 6, 7, blah, blah, blah, next row, 28, 29,
30, blah, blah blah.
And that formula is let index--
oh, well, I need to do that before this nested loop
because right now, I just want the center pixel-- that x,y.
Let index equal x plus y times img.width.
But there's more, oh!
So this is the form.
And if you think about it, it makes sense
because it's all the x's, and then the
offset along the y's is how many rows
times the width of the image.
But there's another problem, which
in this image, there are actually
four numbers being stored--
an R, a G, a B, and an alpha--
the red, green, and blue channels
and the alpha channels--
So each pixel takes up four spots.
So this index actually needs to say times 4.
So guess what?
You know it's going to make a lot of sense.
I'm going to need this operation a lot.
Let's write a function for it.
I'll just call it index, and it receives an x, y, and a width,
and it returns--
you know what?
The width is never going to change in my sketch,
so I don't want to be so crazy as to have
to pass it around everywhere.
So we're just going to pull it from a global variable.
Return x plus y times img.width.
And that's not img, it's cat.width.
OK, so once again, this is terrible what I'm doing,
but I'm just saving myself a little bit of heartache
here and there.
So this index-- ooh, let's call this pixel.
Oh, and this should be times 4.
This pixel is that function index x,y.
Now, I have something I could do to simplify this,
but I might as well write the code for if this
were a full RGB image.
This is a grayscale image, but it has all the channels in it.
The thing that I need to do to perform this convolution
operation is to take all of the weights--
the numbers that are in the filter matrix--
and I need to multiply each one times the pixel value of all
of the neighbors and their corresponding locations,
add them all up together, and maybe divide by something
if I wanted to sort of, like, average it out.
But in this case, I actually don't want to divide
I'm just going to leave the weights are the weights are
the weights are the weights.
And actually, this right here is irrelevant.
I need to do this inside the loop.
You'll see in a second.
I think it's going to make sense.
So I need sum.
I'm going to make a sum of all the R values,
a sum of all the green values, and a sum of all
the blue values.
All right, wait a sec, wait a sec, wait a sec.
Actually, I think this is going to make more sense.
Let's go from negative 1 to 2.
You'll see why.
I mean, I'll explain why.
And negative 1 to 2.
Let's do that instead.
And maybe it's more clear to say less than or equal to 1.
Less than or equal to 1 because--
and let me draw this diagram once again--
if this is pixel 0,0, this is pixel negative 1, negative 1.
This is 1,1.
This is 1,0.
This is 1, negative 1.
I guess I'll do them all.
So you can see that the neighboring
pixels are offset by negative 1 and 1, and negative 1 and 1.
So the pixel x value is x plus i.
The pixel y value is y plus j.
And then the pixel index is call the index function
x, which returns the actual index into that array
for pixel x and pixel y.
And actually, maybe it makes more sense for me
to just say that I don't necessarily
need separate variables.
It might actually be just as clear just
to put this right in here.
So now, I just need to add the red, green, and blue values
of this particular pixel to the sum.
So sumR plus equal img.pixels at that pixel index.
And then G and B. G is the next one,
and B, blue, is the next one.
And let's add a plus 0 here just to be consistent.
So ultimately, what I'm actually returning here is r is sumR,
g is sumB, and b is sum--
oh, sorry, g is sumG and b is sumB.
So this is the process now of adding up all the pixels.
I've gone through every single pixel in a 3
by 3 neighboring area and added up
all the reds, greens, and blues, and I'm returning those back.
But I'm missing the crucial component, which
is as I'm adding all the pixels up in that area,
I need to multiply each one by the value in the filter itself.
Incidentally, I should also mention
that the operation that this really is is the dot product,
and in an actual machine learning system,
all this would be done with matrix math,
but I'm doing it sort of like longhand just
to sort of see the process and look at it.
What should I call this in the filter, like the factor?
Now, I need to look up in the filter, i,j.
Only here's the thing--
because I decided to go from negative 1 to 1,
negative 1 to 1, the filter doesn't
have those index values.
It goes 0, 1, 2, 0, 1, 2.
So this has to be i plus 1, j plus 1.
So it's all six of one, half dozen of the other,
whether I go from 0 to 2 there and do
the offset in the pixels.
But the point is the pixel array,
I'm looking actually to the negative and positive
to the left and right, but the filter is just a 3
by 3 array starting with 0,0 on the top left.
So now, I should be able to multiply by factor.
And there we go.
I have the full convolution operation.
Now, I might have made a mistake here.
I think this is right.
When I run it, we'll find out if I made a mistake.
I'm summing up a 3 by 3 neighborhood of pixels,
all multiplied by weights that are in a 3 by 3 filter.
Oh, but I actually have to call that function here.
Now, it should be relatively easy because all of the work
was in there.
So if I say let I'm just going to call this rgb
equal convolution, the cat at the given x and y
with the filter, then the new image, which is called filter--
I have to look up.
The pixel is index x,y, and then filter--
so I have to look up the one-dimensional location
in the new image, and then at .pixels at that pixel is
the red value that came back plus
0 plus 1 plus 2, green and blue.
And then if all goes according to plan,
I should be able to draw the filtered image
at offset to the right with the same size.
I did miss something kind of important,
which is that if I am working with pixels of an image in p5,
I need to call loadPixels.
So cat.loadPixels filtered.loadPixels.
And then I haven't changed the pixels of the original cat
image, but since I changed the pixels of the filtered image,
afterwards I need to call updatePixels.
And now is the moment of truth.
Never good when I press the snare drum button.
I'm going to run the sketch.
All right, well, I've already got an error.
Cannot read property loadPixels.
Oh, filter, filter, filtered.
That should be filtered.
Also this isn't right-- createCanvas.
The size of the canvas is times 10 times 2 times 10.
Remember, the image is just 28 by 28.
Let's try this again.
Well, a little bit better.
We didn't get any errors.
I don't see an image.
Do I need to give it a hardcoded transparency of 255?
So it was fully transparent.
So I'm not pulling the transparency over.
I could pull it over, but I just know
I don't want it to be transparent.
Look at that.
Look at how it found the--
oh, oh, oh, oh.
Look at this.
That doesn't look like it's finding the vertical edges--
pixels that are different to the left.
It looks like it's finding horizontal edges.
Even though I've typed this out in a way
that visually, these negative 1's appear in a column,
it's actually those correspond not to the j index,
but to the i index.
So I think one way to fix that would just be to swap it here.
And maybe there's like a more elegant way of doing this,
but this now, if I run it this way,
you'll see, ah, look at those horizontal edges.
So now, we see how this convolution
is applied to the image.
The difference in the neural network here--
the convolutional neural network--
is we're not hardcoding in specific filters
that we know highlight things in an image.
The neural network is going to learn
what values for the filters highlight
important aspects of the image to help the machine learning
task at hand, such as classification.
So it might draw out, you know, cats
tend to have ears that appear a certain way and this kind
of filter, like, brings that out, and then leads
to the final layer of the network activating
with a high value for that particular classification.
So just to keep my example simulating the neural network
process a bit more, let's just every time I run it,
give it a random filter because that's what
the layer would begin with.
Just like a neural network begins with random weights
and learns the right weights, the filters
begin with random values and it learns optimal values.
So right here in setup, I'll write a nested loop
and give it a random value between negative and 1.
In truth, there are other mechanisms and strategies
for the initial weights of a convolutional neural network,
but picking random numbers will work for us
right now just to see.
So every time I run it, you can see
we get a different resulting image that is filtering
the image in a different way.
OK, that was a lot and I think it
would be good to take a break.
So this was the first part of my explanation,
a long-winded attempt to answer the question, what is
a convolutional neural network?
So the first thing to look at is the convolutional layer.
It's made up of filters.
And so this video attempted to explain that.
And I think we could take a break, have a cup of tea,
talk to your pet, or friend, or plant, or something, meditate,
And then if you want--
if you want, you can come back and in the next video,
I'm going to look at the next piece--
the next component of the convolutional layer,
an operation called pooling or more specifically, max pooling.
And then I'll be able to tie a little ribbon
and put a little bow on this explanation
about convolutional neural networks
and move towards actually implementing one
with the ml5 built-in functionality.
All right, so maybe I'll see you in the future
and have a great rest of your day.