FABIO YEON: All right.
Let's get started, then.
This is IO-314.
My name is Fabio Yeon.
I'm one of the engineers on the Global Container Engine Team
and tech lead on a team as well.
"Google Container Engine and the Path
to Cloud-Native Operations," or as I like to call it,
"the pi talk."
Wow, no math person in the room.
All right, so start with the basics--
How many people here have used or are deploying
containers in production?
Wow, that's a pretty good number.
How many of you have played around with containers,
All right, how many people have heard about containers?
So let's start with, then, some simple intro.
What are containers?
The most succinct explanation I've heard of a container
is that a virtual machine virtualizes hardware.
A container virtualizes a kernel.
So what happens, then, is if you have a bunch of containers
on the same machine, they're all sharing the same kernel.
But the container runtime provides enough abstraction
so each one of them thinks they have
the machine for themselves.
So, why are they useful?
Well, in this little box called a container,
you can put in just your application, its dependencies,
and you're done.
Compared to what you had to do to set up
a deployment on a VM or a bare metal,
you don't have to install an OS.
You don't have to configure your networking.
You don't have to configure and set up the dependencies.
It just shows you, your application, your binaries,
and a little container image, and you're done.
In many ways, this is what I call getting serenity
in a sea of chaos.
All right-- whoops.
So once you start playing around with containers,
you start running into little more complicated,
And this is where Kubernetes comes in and helps you out.
Your little box-- if you have one or two of them,
sure, it's easy to get them up and running and deployed.
Once you start getting more of those,
then you start getting a little bit more help.
Kubernetes helps you ensure that there's
a place for your container to run,
makes sure that the proper number of them are running,
makes sure that the network packets intended
for your containers gets routed properly through the cluster
and arrives at the appropriate place.
And, of course, if you use Kubernetes,
Google Container Engine provides to you
the best and purest Kubernetes experience you can get.
Beyond just allowing to set up a Kubernetes cluster easily,
Google Container Engine strives to provide to you
management capabilities that makes it easy
for you to maintain and keep your Kubernetes cluster
So the short little intro, let's dive
in a little bit into what I'm going
to call cloud-native management capabilities
and show you some of the things that we do in Google Container
Engine to help you be more productive
and to care less about the machinations of your clusters
and go back to what you want to do more, which
is maintain your applications and your services
"Change is the only constant in life."
I like this quote for many different reasons.
But that's because in the little container box,
you have been able to provide a level of stability
on your thing, on your applications and your service.
Remember that around you, the clusters,
things may be changing continually,
and you have to have a little strategy
to be able to manage some of those changes.
So just a quick overview of the agenda.
Gonna talk a little bit about Kubernetes, the Kubernetes
versions and why the version management
is kind of important for you.
Talk a little bit about node health--
what it takes to maintain your node health
and how to evaluate that.
There's a little bit about operations logs and then
talk about some strategies that we
have come up with for successful operations of a cluster.
So keeping up with Kubernetes.
So remember, Kubernetes is an open source project.
As such, it has its own ecosystem,
it has its own community, and it also
has its own release cadence.
Kubernetes strives to get one minor version
release a quarter.
Now [CHUCKLES] this is where a minor version is like a 1.4,
a 1.5, or sometime later this month,
I think a 1.6 is slated to come out.
It seems-- seemingly like a small thing.
So it's a small number change.
But because Kubernetes is still a fast-moving, fast-developing
platform, every quarterly release
tends to be a fairly large release, a fairly big release.
New features gets launched, some existing features
get upgraded from alpha to beta, beta to ga, and so forth.
So every minor release of Kubernetes
comes in as a big package.
If you look at the release notes or change
logs for every minor release, it is very comprehensive.
It is very extensive.
Beyond that, every couple of weeks patch releases come out.
These may be bug fixes, security fixes, small improvements
that were missed, and so on.
In Google Container Engine, our goal
is to provide to you the latest version of Kubernetes
as quickly as possible.
Order days, if possible, but if not, sooner.
And we also provide with you a collection of versions,
because we understand that not everything is perfect.
And therefore, sometimes you want to hold back a little bit.
So we provide to you the latest version of Kubernetes--
one older version, plus the latest version
of every previous minor version that's been released.
Now, I know that's a little hard to parse, which
is why I put an example here.
Currently, the latest version of Kubernetes available is 1.53,
and that is available.
The previous version of that is 1.52,
which is the previous patch release of 1.5.
And then we try to provide the latest version of the 1.4
branch, which is 1.49, and the latest version from the 1.3
branch, which is 1.310.
So why does collection of versions,
why this seemingly subset of minor versions and not others?
In Kubernetes, the master and the nodes
can be running different versions
of Kubernetes binaries.
And the official support states that--
by the community and so forth, is that the master and the node
can differ by, at most, two minor versions.
So Kubernetes Master is running at 1.5.
The oldest version of Kubernetes that can be running on a node
and still be considered supported
is something in the 1.3--
1.4, 1.2, 1.5, 1.3.
When 1.6 comes out, that will go up to 1.4.
Anything beyond that, meaning that any version skewed
beyond those two minor releases is considered out of support
by both Kubernetes and Google Container Engine.
It goes into what is called best effort.
Your node may still be running, it may still work.
But the likelihood is that some functionality may or may not
work and may break in the future.
So Container Engine, remember, will automatically
upgrade your cluster masters.
This is one of the things that we offer to you
when you create a Kubernetes cluster on Google Container
Among the things that we do in managing
a master is to upgrade it to ensure that it is at the latest
version of Kubernetes.
So, as an example, if six months ago you created a Kubernetes
cluster on Container Engine at version 1.3, at some point,
your master was upgraded to 1.4.
And if you didn't do anything [INAUDIBLE] nodes at 1.3,
you were still OK.
It's one version difference.
At some point in the near past, it
was probably upgraded to 1.5.
So now your master is at 1.5 and your nodes remained at 1.3.
And 1.16 comes out, and at some point
in the future, when Google Container Engine upgrades
your masters to 1.6--
If you have upgraded or nodes, you're
not officially out of support.
So if [INAUDIBLE] to do anything, because
of the management of the master and the auto upgrade that
exists, you have to upgrade your nodes
at least once every six to nine months to at least stay
with the support band.
Of course, we highly recommend that you upgrade more often
to pick up some bug fixes and security
fixes that may have [INAUDIBLE] show up in Kubernetes.
So we come now to the very first feature
that we have on Google Container Engine
to kind of help you manage some of this change.
Node auto upgrade.
Just like the cluster master, where we manage it for you,
node auto upgrade will monitor the nodes
that you have in the clusters.
And if we detect it, it is, shall we say, older
than the master, we will automatically trigger
an upgrade of the nodes.
It is the same exact logic that is applied if you
upgrade your nodes manually.
It can be enabled and disabled at any time.
And just briefly, you can actually do this--
whether the auto upgrade is enabled or not,
you can always upgrade your nodes manually first.
If you want to try out, test out things,
or try to get the latest version into your cluster sooner.
Now, we kind of try to do this also in a very good judicious
Patch upgrades, because they're typically bug fixes or security
fixes, tend to be applied much more quickly.
So if my nodes are at 1.50 or 1.51, for example,
and I enable this feature, I will probably be upgraded to 1.
153 fairly quickly.
Because we understand the minor version, changes on Kubernetes
can be more disruptive.
And even though Kubernetes and the community
tries very hard to make sure that it is not it is not
very disruptive and lack of compatibility
and all those things are taken care of--
but because we understand that sometimes they
can be a little bit more disruptive,
minor version updates happens at a slightly slower pace.
So, for example, the 1.4 to 1.5 upgrades
will happen either later cadence than the minor patch releases.
So how to enable this thing?
You can enable in the gcloud.
When you create a cluster, it is on the gcloud beta channel
set of commands.
The flag is a enabled dash auto-upgrade.
You can also do it when you create a node pool,
and you can also do it in the UI at cluster creation time.
Like I mention, you can also enable or disable
disk of ability on node pool at any given time.
One thing to keep in mind, if you disable this feature
while an auto upgrade is running,
it will run to completion.
It just means that next time you do a scan of your clusters,
it will not auto upgrade it.
Now, let's pick a little bit behind the scenes
as to what Google Container Engine is doing for you when
this thing happens and go over the set of scenarios in case
you want to opt out of this and do it yourself.
So first thing, as I mentioned before, you can always
trigger upgrades manually-- does not affect the settings
for auto upgrades.
The first thing is that every week, Google Container Engine
does a push, and we do release nodes.
Among the things we publish in this released node,
along with bug fixes and teacher announcements,
is what versions of Kubernetes, if any,
are made available to you as a user.
We also provide the same functionality or notification,
both in the CLI, via gcloud whenever you list it,
or in the UI.
Notice that there are two versions that
are listed in this particular output
have two different types of asterisks.
The more asterisks there are, typically the worse, the more
dangerous, the worse it is.
In this case, it is telling you that while my cluster is
one version behind and the second one is
two versions behind, I, unfortunately,
did not have one that was three versions behind.
But you can imagine that would be--
the mourning would be even more severe in this one.
And in the UI, of course, I think
there the upgrade available notification
is available, as well, whenever an upgrade is
available for your nodes.
If you want something a little more programmatic,
gcloud container get server config will give you
all the current information about
the potential configurations of your Google Container Engine
for a particular zone.
And its particular output, you can
see that the default cluster version is 1.53,
but the valid masters are now 1.53 and 1.49,
which is the latest version from 1.4, which we still
are currently supporting at this time.
And then your nodes can be any number of these versions.
Notice that even though 1.27 is still available, it is only
supported if you actually have your master at the 1.4 branch.
If you have a master in the one 1.5 of Kubernetes,
the oldest version of Kubernetes that
could be supported on the node is in the 1.3 branch.
So thinking a little bit back.
I mentioned that Google Container Engine managers
One of the things that we do is upgrade them.
You yourself can trigger an early upgrade of the master
before we get to it, if you still choose to do it.
And the option is from the command line.
It's a Container Engine upgrade, and then you
add the dash, dash master option.
This tells the system that you want to upgrade the master.
And then you can go ahead and trigger a manual upgrade
of your nodes with a send command, except this time,
you just take out the dash, dash master.
Why do we upgrade the master first?
It's because the node can, at most,
be at the same version as the master.
You cannot have a node that is a version higher,
even a patch version.
So in the hypothetical world, the 1.54 were to come out today
and made available to Container Engine customers.
If you want to try out 1.54 before we upgrade your cluster,
you would have to upgrade your masters first to 1.54
and then upgrade your nodes in your node pools
to 1.54 manually.
And this is regardless whether you
have enabled auto upgrade for your nodes or not.
You can just do it yourself.
Ah, so we've started doing an upgrade
or we trigger an upgrade of the nodes,
and you detect that something is going awry,
something has gone wrong, and you need to stop.
We have two sets of commands that are currently
in the alpha channel of gcloud.
The first one-- actually, the second one this list,
but I think the more important one is you just
want the upgrade to stop.
So you call gcloud, alpha container operations,
cancel and give you the operation ID, which
you can get it from listing the operations currently running.
The operation list is probably going
to say something like auto upgrade
or a manual upgrade, cluster upgrade.
Node upgrade, I think.
You can cancel the operation.
Cancelling an operation does not stop
an ongoing upgrade of a node.
So if a cluster has five nodes and you detected that
after the first one, something has gone south
and you want to stop.
If the second node upgrade has already started,
the cancelling of the operation will--
that upgrade will finish, but the subsequent upgrades
of the nodes in the node pool will not start.
The upgrade will be canceled at that point.
Now, your cluster is going to be in this kind
of a mixed-mode version between couple
of nodes are in the new version, a couple of nodes
in the old version.
So then you can do what's called a rollback.
If you call gcloud alpha container node pools rollback,
we will then take those nodes that were at the newer version
and roll them back to the previous version.
Couple things to keep in mind, as well, as you're doing this.
This is a recreation of a VM.
So if you happen to have any local configuration
or local data stored on those VMs,
fortunately, they don't not survive an upgrade operation
at this time.
Healthy nodes, healthy clusters.
Keeping your nodes healthy can be a little tricky.
The nodes can go unhealthy for a number of reasons.
You can run out of resources.
Maybe you over provisioned stuff and you used
too much memory, too much CPU.
It can run on a local disk.
Maybe you have a configuration problem or a bug in Kubelet
that is causing it to crash.
Maybe your workload is triggering a kernel bug,
or, of course, network segmentation network drivers
crash, so on and on.
Some of these are auto recoverable by Kubernetes,
specifically the first two, where you run out
of memory or resource.
Some signals provided to Kubernetes
to the cluster master and the rescheduler controller
on the master, you're supposed to observe
these things happening and say, oh, this node is overloaded.
Let me just take away some pods from it
and scatter them elsewhere.
So they're self-healing for the most part.
As long as you have enough resources in your clusters,
then there's nothing you need to do.
But if your click Kubelet starts crashing on you,
if your current node became deadlocked
because you happened to trigger a kernel bug, what then, then?
This is where it starts getting a little more interesting.
Second thing to remember about node health
is that it is the master's evaluation of your node's
health that is important.
Because it is the cluster master that determine what pods
are scheduled where.
So if your master thinks that your node is unhealthy or vice
versa, you may not schedule anything on it
if it's considered-- if it thinks it's unhealthy,
or if it's giving a false signal that is healthy but it is not,
and we try to schedule work into an unhealthy node.
So that's something to keep in mind.
It's kind of important, is that the self-evaluation
of the health of the node is less important here.
It is what the master thinks of your node's health
at that time that is important.
And, of course, repairs like upgrades
are typically limited to recreations of a node.
Oops, wrong button.
So this comes to a second [INAUDIBLE] container
on Google Container Engine to help you manage your cluster.
It is node auto-repair.
It is the semantic equivalent of the master repair
that we already do for your cluster today.
But this-- this is where we are observing and monitoring
We are taking into account the master's view
of your node health state.
We are also evaluating other signals around it.
We take a look at the signals from the managed instance
group backing your node pool.
We also sometimes take a look at the signals
provided by the node itself.
With those signals, the of it all, be coming to an evaluation
whether your node is healthy or not.
Too many unhealthy signals will trigger repair--
in this case, a recreation of the node, in
attempt to get it back into a good state
and having rejoined the cluster.
We also try to make sure that we rate limit this so that we're
not doing this too often.
Now, why rate limit?
It's because sometimes the node going on health
and may be triggered by your workload.
And if you start triggering this wait too often,
we may be in a situation where the node never
is alive long enough to do any useful work.
So we have somebody do some of the rate limiting as well.
Currently in beta-- as a matter of fact,
I think it's rolling out as we speak,
and so fairly soon you should be able to have
this capability show up in the UI and in the CLI as well.
How do I enable it?
Just like auto-repair, the flag is
dash dash enable auto-repair.
You can do this at cluster creation
time, at node for creation time.
And, as a matter of fact, I would highly
recommend that you enable both of them--
auto-repair and auto-repair, to make your life a little easier
when you run Kubernetes on a Google Container Engine.
One more thing before I go into the next slide,
which I've [INAUDIBLE] here.
You can also enable and disable this at any given time.
It is just like for auto upgrades.
You do a gcloud beta container node pools update,
and you can enable or disable auto-repair on your cluster--
on your-- sorry-- on your node pool
at anytime you that you want.
So let's now peek into behind the scenes
and see a little bit of what Google Container
Engine is doing on your behalf.
And kind of give you an idea if you wanted to do this yourself,
what kind of things you have to do.
So I mentioned that it is the master's view
of your node health that matters.
So how does a master know what's going on in a node?
The Kubelet publishes to the master periodically
a bunch of information.
And one of them is its evaluation
of its node health plus what's going on in its internals.
All this information is available from the node object
from your Kubernetes master.
So from the command line, you yourself
can get the same data by calling [INAUDIBLE] control, get nodes.
Of course, in this output, it gives
a little more formatted-shortened version
But if you see the JSAR or YAML output,
you can actually get at the raw data.
And in a little bit, in the demo,
kind of show you how I can look for this.
So you're listening your nodes, and you're determined
and one of them has gone unhealthy, not ready.
So what are the steps now for you
to try to get back to a good state?
Well, the first thing you should do is cordon off your nodes.
What this thing does is tell the scheduler,
hey, regardless of what's going on in this particular node,
don't schedule any further work on it.
Next thing we do, we attempt to do a drain,
try to do a graceful drain of all the pods,
make sure they all go away some place
so that the pod becomes empty.
This may or may not succeed depending
upon the state of Kubelet and other things on your node.
If your node had a hard deadlock because of a kernel issue,
this may not work.
But at least we try.
The next thing is we try to list your manage instance group back
in your node pool.
And from that manage instance group,
we have to identify the specific node from that
and then try to tell the manage instance group, hey, please
recreate that node.
Because we have set up--
on Google Container Engine we have set up your cluster
using instance group templates, a simple recreation
of it using the existing template
gets your back a node into your cluster.
Operations log-- [CHUCKLING] as we
start getting more and more capabilities on Google
Container Engine to help you manage your cluster more
easily, a question that keeps coming up is, like, hey,
how do I find out when you guys have done something
on my cluster?
So the first that we have done right now
is that we are augmenting the operations log for a Container
Engine to contain more information about when
we do automated operations on your clusters on your behalf.
So before we had repair clusters and upgrade masters available,
and we're adding auto-repair nodes and auto-upgrade nodes
as additional operation log types
so that you can find out whenever we have done
these things on your behalf.
How do you get to the operational log?
Well, gcloud container operations list
will give you a order list of all the operations that
have been done on your cluster.
Actually, I think it's beyond just
your cluster-- is in your project across your clusters.
And then if you do a container operations describe
on an operation, it gives a little more detail
as to what we did and what the status was.
Were we successful?
Did we fail?
So on and on.
Node pools and resource management--
another favorite topic of mine.
How many people here have heard about node pools
or have used node pools?
So what are node pulls?
Node pulls-- the shortest definition is just a collection
of nodes managed together.
From a Google Container Engine perspective,
what it means is that you can have multiple node
pools in your cluster, each are defining
a particular configuration.
And When I say configuration, I mean machine type,
machine resources, and Kubernetes versions,
scopes, and on and on.
Within a node pool, all the machines
have to have the same exact configuration.
They have to be the same exact machine type,
they have to have the same exact machine--
They must have the same exact machine resources.
So if I ask for an n1-standard-4, with, let's say,
two local SSDs, every single machine in that node pool
will have the same configuration.
But, of course, I can have multiple of these in a cluster,
so this gives me an additional level
of functionality-- flexibility, as well, where
I can have different machine types and different--
and mix and match these to meet my workload
and my resource needs more carefully.
And this is also a foundational unit for many management stuff
that I talked about, so when, before, I was talking
about cluster creation, enabling it
at cluster creation or node pool creation,
when you create a cluster on Google Container Engine, what
is actually happening is that we're
creating the master and so forth,
and we are also creating a default node
pool for your cluster.
So all the options available for creating a node pool
are available on your cluster creation as well.
So let's take a look and some things
that you can do with node pools might be kind of interesting.
So can we switch to the demo machine, please?
So you go container clusters list.
And here I have my cluster that I created for this demo.
It is currently running on master 1.53, and oh, wow--
my master was upgraded since I ran this demo,
so my nodes, right now, are version 1.52.
So now I have a notification that a node is--
upgrade is available.
It also tells me that my current cluster is running
and, of course, my master IP address.
Let's take a peek underneath, then,
at my node pools for this particular cluster.
Of course, pedantically-- has to be plural.
I have three node pools configured for this cluster.
I have a main focus comprised of [INAUDIBLE].
I have something that I call a pre-pool of n1-standard-2s,
and something that I called a lowmeme pool, which
is running the custom machine.
A custom 2x2048, which signals to me
that this is a two-core machine with only two gigs of RAM.
Let's then describe the main pool--
oops, let me pipe that.
n1 [INAUDIBLE] 4, identification scope's defined.
The number of initial counts enabled.
And, of course, in our management,
I can see that I have enabled both auto repair and auto
upgrade for this particular pool.
OK, I think-- interesting on this one.
But if I go to my pre-pool, among the things that you'll
notice on this particular listing
is that this thing is comprised of preemptible VMs.
Preemptible VMs, of course, are options provided to you
from DCP, where the costs for these are lower.
But in return, the VM may be preempted underneath of you
with a moment's notice.
But if you have a workload that are preemptible-friendly or can
be preemptible easily, this is an easy way for you
to instantiate a couple node pools in your cluster to have
this and leverage it--
the lower of cost that they provide.
And then, of course, the final one, which is my lowmeme pool.
It is just like my main pool, except the difference here
is that I'm running custom 2x2048 machines.
So, like I said, very low memory,
high, superior kind of workloads would benefit
from this kind of node pool.
So previously, I have already pre-deployed a number of pods
on this particular cluster.
So let's take a look at all of them,
make sure they're all running.
So I have deployments that are very high CPU.
Some deployments that are just regular, main workloads.
And I have something that I'm calling
preemptible workloads are workloads
that are eminently preemptible.
With a little bit of trickery--
and this is where I will cut and paste the command,
because I can never remember from memory.
I called [INAUDIBLE] control pause
provided a JsonPath path that says,
hey, fish out this little fields from your particular listing
and then combine them to write this output.
So in this view, you can see that all my pods are listed
there like before, but now I'm also listing
which node they are running on.
And you can see this thing scattered all over the place.
My high-CPU workload's running in a variety
My main machines-- I have main workloads running
on a variety of other machines.
But more importantly-- they've been not a good idea--
I have some workloads that are running on my preemptible VMs.
Should be OK, maybe, but certainly not ideal.
Because maybe they're not workloads
that you want to run on a preemptible VM.
So, how do we then take care of that?
Let's take a look, then, in a node
and see what I can find out.
So if I list the raw Json object for a node,
you will notice a couple of things here among the metadata
It is that there's a whole section for labels.
Some of these labels are applied on each node
by the Kubernetes runtime, and they are identified easily
by the kubernetes.io prefix.
So thing like machine architecture, in the 64,
Down here-- host name, host name on machine.
You'll also notice a few of these
that are actually set by Google Container Engine itself.
They are prefixed with cloud.google.com.
In this case by default, every single node
that we create on Google Container Engine
has, automatically, as part of it,
the node pool name associated with it.
And, of course, the other thing there is this other label.
On Google Container Engine, when you create a node pool,
you are allowed to provide to it a comma-delimited list
of key value pairs that become part of your node labels.
So you can, then, use these on your pods spec in Kubernetes
so that it ensure that those workloads, those pods,
ends up at a node that has this label in them.
One more thing I want to show up along here.
This is another node.
In this case, this happens to be a VM that is a preemptible VM.
When you create a preemptible VM--
a node pool, sorry-- in Google Container Engine,
we will automatically add apply this label as well,
so you don't have to do it.
You can certainly add more, like I had done below,
where I've added the pre label as well, where you can actually
add the Google Cloud Comp.
You can use this label as part of a node selector on your pod
spec as well.
Oh, before I go off--
I mentioned earlier about node condition.
If you want to explore a little bit more
about doing the node health evaluation yourself,
every single node contains this array of node conditions
under the conditions list.
And there are a variety of these.
Some of them talk about [INAUDIBLE]
health, disc, memory, pressure.
And, of course, the bottom one is whether the Kubelet is ready
This is part of the signal that is pushed from the node
to the masters periodically to let the master know
what the status of the node is.
And this is the signal that both the master and Google Container
well, one of the signals--
to determine whether the node is healthy or not.
So back to this demo.
I have, then, deployment YAML for my high-CPU workload.
Of course, I have commented, for demo purposes,
the node selector.
So let's go ahead and delete the comment tags
and make sure that the node selector's there.
I'm gonna go to my other workloads as well.
And remove the node selector comments.
And notice that on this one, at least, while on the other ones
I've used my custom labels that I've set up on each node pool,
on this one, I just decided to use
the one that Google Container Engine sets
for preemptible VMs.
Might be kind of nice, because at this point,
I'm not relying upon a custom label that I've set
or the pool name or anything like that.
I'm just using an automatically set label that Google Container
Engine has put on it.
So it's available across any VM that is printable.
Let me go ahead and replace the deployment on Kubernetes.
Give control, get pods.
Bunch terminating, bunch running,
bunch container running.
We'll let this run for a little bit,
let everything kind of settled down again.
Wow, taking a little bit longer.
OK, things seem to be settling in, settling down.
Things are getting scheduled.
Ah, everything is running again.
So if we go back to that somewhat complicated Kubelet
control that I had before, what you'll find here now
is all of my high CPU deployment--
it is now running on my low-memory pool.
My main workload is my main pool.
But more importantly than anything else, my preemptible,
friendly workload is now all scheduled
and running on preemptible VMs.
So with this, I was able to, shall we say,
make sure that my high-CPU workloads end up on machines
that have very high CPU of low amounts of memory--
save me some money, ensured that the proper VMs
are assigned for those.
My main workload can be whatever.
My generic set of workloads can go on the main workload.
But more importantly, my preemptible
workloads that low priority, low cost are all set up
and now are all running on my preemptible VMs,
exactly as I desired to be.
Couple more things to keep in mind--
there are more node selector options
becoming available to you as part of the Kubernetes 1.6
So, for example, node selector right now is a match.
One of the things that is coming up
that might be kind of interesting, especially
in this kind of scenario, is the [INAUDIBLE].
You may have workloads that I don't really
care where they run, I just don't want them
to run on this type of VM--
preemptive VM, for example.
So that's one of the options that will come online--
available in the 1.6 timeframe, I
believe, where you can just state that in your pod spec
and will be available when you deploy them.
So can you go back to the slides, please?
So a couple of slides I put it in here for you to reference.
Feel free to take a look at it, how to list some node pools,
how to create the node pools that I've-- whoops, sorry--
that I've used for this demo.
Notice the machine types, notice the dash,
dash printable option as an option, and the node labels,
how to specify these on your node pools.
And, of course, in assigning pods to node pulls,
it's the node selector option in your YAML file.
OK, some final notes.
Upgrades can be disruptive.
So to help mitigate them, you can use--
the upgrades are for node pool.
So you kind of use them to segment
the upgrades of your cluster so that you can have either
a canary test kind of pool, you can try out newer versions.
So you have an existing workflow that is running at version 1.51
and you want to try a 1.53, you can
spin up another node pool at 1.53,
schedule some workload on that particular node pool,
make sure everything works fine before you
let the rest of your cluster be upgraded.
Consider having what I call the one-node slack
to ensure proper capacity.
When we do upgrades, we take down one node, do the upgrade,
bring it back up.
Go on to the next node, and so on.
While the one node is down and being upgraded,
it would be nice for you to have enough slack on your cluster
so that whatever pod was running there can find a new home
while the upgrade is occurring.
Google Container Engine limits you
to having one cluster operation running at any given time.
Now just to differentiate between the Container Engine
operation versus the Kubernetes operation, for example,
these are things like adding or deleting a node pool,
updating a cluster-wide configuration,
doing an upgrade, a repair, any kind of events.
All those are serial on Google Container Engine
for various logistic reasons.
So for the most part, this restriction
is not a big deal except if you're
upgrading a fairly large cluster or node pool.
And you can imagine that because we're
doing the upgrade sequentially, it
may take a little while for it to complete.
In the middle of it, if you have to do something
on your cluster, go ahead and cancel the upgrade.
It will stop at a good point.
You can do whatever operation you
have to do on your clusters, and then
re-issue the upgrade command.
It will just kind of pick up from where it was left,
scanning your nodes there we have,
skipping those that have already been upgraded and just
keep on going the rest of them.
And, of course, we're truly trying
to make progress and try to minimize some disruptions.
And we have more capabilities that we're
going to be adding in a future for upgrades and repairs.
Some of the best practices that we have found--
pod disruptions are a reality.
Try to prepare your workload for them.
There are a number of capabilities
being added to Kubernetes that are gonna help you with that.
So please do take advantage of them.
And we'll keep integrating them into Google Container Engine.
Do use deployments of replica sets for your pods.
Make sure that you have enough sufficient redundancy.
The second point is actually kind of important.
In the cloud-native world, please
to not write set once and forget configuration scripts.
The mindset you need to have is reconciler.
You need to write reconcilers that read the current state.
If it's not the desired state that you want,
do some action to try to get there.
And then once it's completed, go to sleep and then
start over from step 1 again.
This way, whether it's a repair event, an upgrade event,
or some other things that happening outside
of the container, if you have some configuration
that you absolutely need, you know
that at some point, if it goes out of alignment for whatever
reason, it is a temporary thing that hopefully
should autocorrect itself.
And, of course, please use node pools.
Organize your resource more efficiently in your cluster.
Allow you to have [INAUDIBLE] configurations.
Leverage different types of machines, different types
of resources, and so on.
A quick recap-- change is constant.
Please be prepared.
Do enable auto-upgrade, and then try auto-repair.
Operations logs-- please look into them
to see if-- whenever we do something,
we'll try to make sure that they all end up in your operations
log for your clusters so nothing is hidden.
Use your node pools to organize your resources.
And, of course, think reconciler, not set and forget.
For those that may not be familiar,
here are a couple links for the Container Engine website,
the Google Groups where the release notes come out
whenever we do a push on Container Engine.
A lot of members of our engineering team
are on Stack Overflow, so if you have a question that
is container-engine specific, please use
the tag, the Google Container Engine tag in Stack Overflow.
And, of course, as always, the Kubernetes
GitHub depositary of highly technical, kind of complicated.
But always looking for contributors for the Kubernetes
And while you're here at GCP Next,
I highly recommend going to other container talks
if you want to find out more about us.
IO205, I think is this afternoon and it's
a good intro to other capabilities and features
that are being added to Google Container Engine.
And, of course, I also want to highlight IO307,
even though the title says ABC is a--
Google Container Engine tips and best practices
is actually a talk mostly about monitoring with your Google
Container Engine cluster.
And with that, I want to say thank you for coming.