Practice English Speaking&Listening with: Lecture 31: Markov Chains | Statistics 110

Difficulty: 0

So, we're gonna start on our very last topic which is Markov Chains.

And as you know, I have one more homework due on the last day of class,

which is next Friday.

Almost all the sections are still meeting today and

tomorrow because there's really way too much material for one more section.

So this week we'll focus on law of large numbers, central limit theorem,

multivariate normal, which we have been doing.

Then the last section next week then

the TFs will do examples of Markov chains, which is our last topic.

Okay, so I'll just start from the beginning of Markov chains.

There's also a handout I wrote on Markov chains which you can read at

some point if you haven't seen it already, just posted under the notes.

One other thing not to scare anyone before Thanksgiving, but the finals

are on December 15th, so it's not really too soon to start thinking about it.

I did post a final review handout, which is 66 pages,

so there's a lot of stuff you can study.

The 66 pages consist of 23 pages of review and

then the rest of it just consists of literally just all

the past 110 finals that have been at Harvard since 2006.

I'm teaching 110 at every Fall,

and I just don't really believe in withholding information, or some people

get access to more exams than others cuz they have a friend from whatever.

This is the entire history of 110 finals,

there's five full length practice exams, and

I think you should be, whenever you can, starting practice, practice, practice.

And I posted the solutions to them as well, but

you should try your best to resist looking at the solutions,

except to check your answers or if you're really, really stuck.

So I would highly recommend doing it at least two or

three of those exams under as close as you can to exam conditions.

That is, time yourself strictly.

You have four pages of notes double-sided, that kind of thing.

Then after each one, then you can go through it.

If you run out of time and you're not done, it's still a good idea to spend some

time working on the problems still without looking at the solution.

And then finally look at the solutions, study your notes,

study everything to see where you can improve.

So we've covered everything now except for Markov chains, and

so if you're starting on those practice before

we've gotten far enough in Markov chains it's not too much of a problem.

Because in general the problems are not intended to be in

order of the first problem is the first thing we did in the course and so on,

it's not really chronological like that.

But it has been true every year that I put Markov chains as the last problem and

no where else.

So if you haven't gotten far enough on Markov's chains yet,

you can just skip the last problem of each year.

Until we get a little farther with Markov chains.

Everything else we've covered the material so

the sooner you start practicing the better.

So you can assume that the style would be similar to those past exams and

now that will give you a good sense of everything.

So Markov chains, what is it?

Well, they are invented or

discovered depending on your philosophical point of view.

They were invented or

discovered by Markov who we've already heard of through, Markov's inequality.

He introduced them just over a hundred years ago.

I think that's 1906.

Now I'll tell you a little about Markov's original reason for

studying Markov chains, but

the number of applications they found since then has just been unbelievable and

a lot of directions it's gone in that Markov would not have anticipated.

So Markov chain, it's an example of a stochastic process.

I'll tell you just very briefly what a stochastic process is.

It's one of the most important examples of stochastic processes.

A stochastic process, we've been dealing,

originally we were dealing with one random variable at a time, and

then we're sometimes doing two, and with the central limit theorem and

law of large numbers, we were looking at sequences of random variables.

So sometimes we have sequences of random variables, and

we're studying the sample mean, and let n go to infinity, stuff like that.

But when we're considering this sequence of random variables,

mostly we've been assuming that they're i.i.d.

The central limit theorem and law of large numbers, there are more general versions

of those theorems, but we've been assuming i.i.d which is a very strong assumption.

Independent identically distributed.

So Markov chain is an example of a stochastic process.

Which basically just means random variables evolving over time.

So I'll just say examples of stochastic processes.

So, what does that mean?

I mean, it doesn't literally have to be time but

that's the most common interpretation.

So we have this sequence of random variables X0,

X1, X2, blah, blah, blah, blah, blah.

We've been looking mostly at the case where these are i.i.d.

So we can think of the subscript as time, but if they're i.i.d.

Then is just like saying it's resetting every

time to a fresh independent random variable.

So from a Stochastic process point of view more interesting is when it's kind of

evolved in some way where there is some kind of dependence.

But if we allow these to be completely dependent in some

arbitrarily complicated way, that can be extremely complicated.

You have this and they can all relate to each other in some very complicated way.

So Markov chains are kind of a compromise.

It's basically once going one step beyond i.i.d.

So they're gonna become dependent.

But it's in a very special way that I'm gonna tell you about.

That has very nice properties.

So, we're gonna think of

Xn as the state of a system.

So, it can be interpreted very generally.

You have any kind of system, we'll see examples,

that you're imagining some kind of particle or

whatever that's wandering around randomly from state to state.

It's jumping from state to state, and

that's at time n, which we're assuming is discrete right now.

That is n is a non-negative integer.

So we can consider stochastic processes in continuous-time or continuous-space.

Continuous-time or discrete-time,

continuous-space or discreet-space, so there's basically four possibilities.

That is this index here, we just do X0, X1, X2.

If we want a continuous time, we could look at Xt,

where t is time, which could be continuous.

Then we'd have a collection indexed by a continuous variable.

Here we're just indexing it discretely, so this is the discrete time.

So in STAT 110, we're only gonna look at discrete time processes.

If you wanna go further into stochastic processes, I highly recommend STAT 171,

which is an entire course on stochastic processes.

We're gonna do Markov chains in discrete time, so

this is the discrete time And we're solving discrete space.

Continuous space would mean that each of these Xs is allowed to

take values like let's say any real number or values in some continuous space.

But for our purposes we're gonna assume that each x j takes

values in same discreet space and actually we're gonna

assume that it's a finite space to make things easier.

So we're gonna be looking at Markov chains where we have finitely many states and

each of these Xs is one of those states.

And we just have this process that's bouncing around randomly from

state to state, and then we wanna describe what's the Markov property.

Okay, so here's the mark off property, what is Markovian mean?

And then I'll tell you a little bit about why did Markov

introduce this and what are some examples.

Markov property is a certain conditional probability statement.

It says if we look at the next,

our interpretation is think of Xn as the current state.

I mean I don't know how to define the word now but

we have some intuitive conception of now.

Now is now, that's the present, and then after now is the future.

And before now is the past, right?

So a lot of Markov chain stuff you can understand intuitively

if you think about past present and future.

So let's think Xn is right now and we wanna predict the future.

We wanna know what's gonna happen at time N plus one cuz we're letting time be


So one step in the future and we have Xn is today, Xn plus one is tomorrow.

Okay we want to know what's the probability that Xn plus one equals J.

Or J is just some state, what we're usually gonna assume that the states

are numbered from 1 to capital M just as a convention, but

there's no rule saying you have to do that.

We're assuming there's finitely many states.

So we're saying what's the probability that the state of the system tomorrow is

state J, given the entire past history of everything?

So we're given, well what's the state today?

Xn = I, and what was the state yesterday?

Yesterday was at xn minus one.

It was at xn- 1, and maybe that's i n- 1, and

xn- 2 is in- 2, all the way back to the very beginning.

And x zero, let's say it's i zero.

So this is saying conditional, what's the probably of a certain state tomorrow,

given the entire past history of the system, right?

That looks very complicated, right?

It took a lot of time just to write all these things and all these dot dot dots,

and this is a big complicated thing.

In a general stochastic process, maybe you can't simplify this very much.

But this is too complicated.

The Markov structure means that only

the most recent information matters.

We can get rid of everything else.

So this just becomes X n+1 = j given Xn = i,

you can get rid of all the rest.

So all these concepts that we've been doing the entire semester come back here

at the difference between independence and conditional independence.

We're not saying that the past is independent of the future,

but we are saying if the properties holds and where's that says that the past and

future are conditionally independent given the present?

So that's kind of a handy intuitive way to remember it, so

past and future are conditionally independent given the present.

Okay, that's exactly what this statement says.

Given the present, that is, if we know the current value, Xn,

if we know that, then anything that's further back is just obsolete,

outdated information which we can just get rid of.

So I just crossed out everything further back in time, right?

Only the most recent one matters.

Conditionally if we didn't get that doesn't mean that if we had xn + 1 given

xn- 1 ,then we are not given xn, it doesn't mean we can get rid of xn- 1,


It's conditional independence.

So that's the Markov assumption.

Whether that assumption is good or not for

a real example completely depends on the problem, okay?

But that's the Markov property.

And in particular for our purposes,

let's actually assume that this doesn't depend on n.

So we'll usually call this thing Ii, or you call it whatever you want.

Let's call it a transition probability.

So, sometimes this is called homogeneous Markov chain.

But a lot of times people will not bother to say the word homogeneous, so you have

to be a little careful from the context, is it assumed to be homogeneous or not?

Homogeneous means that this doesn't depend on time, right?

So this is saying we're looking at the case where the probability,

if the system is at state i, and

then we wanna know what's the probability at the next time point?

It's at state j. If that's always qij, if it doesn't change

with time, then we would say that that's homogeneous or time homogeneous.

And for our purposes we're just gonna be studying

homogeneous Markov chains, so these qujs don't depend on N,

cuz that's a nicer case to look at, okay?

So just to have a mental picture in mind, I just made up a simple little example.

It really helps to have pictures in mind, so here's one Markov chain.

You can make up your own, but here's one that I just made up.

So there are four states we can represent as ovals.

Let's number the states 1 through 4, okay?

And then to specify the Markov chain we just need to specify transition

probabilities, that is what's the probability going from state 1 to state 2,

all the different states.

So it may not be possible to go from any state to any state in one step,

in fact usually it's not possible.

So for example, that's just an example I made up,

let's say the system is at state 1.

It has probably one third of staying there.

And it has probably two thirds of going to state 2.

I'm just drawing arrows and I'm putting probabilities on the arrows that

reflect the probabilities of those different transitions.

From state 2, it's equally likely to go back to state 1 or

to go to state 2, I guess I'm putting the probabilities above the arrow.

So from state 2 either goes here or here, equally likely from state 3.

From state 3, it either goes back to state 4 with probability one quarter.

Or it goes to state, back to state one,

with probably one half, or

it stays there with probability one quarter.

And then, let's say from state 3,

just to make life a little bit easier from state 3, it always goes to state 4.

So I'll just put a 1 there.

You make your own example.

Anyway, it just helps to have a picture like this in mind,

and you can have Markov chain that has infinitely many states, and

then you You can actually draw infinitely many states.

But we're only gonna be looking at the case where there are finitely many states.

So you can always, for our purposes,

you can always imagine a picture like this where we have some number of states.

And then you have a particle randomly bouncing around between states,

following the arrows, and the arrows may have different probabilities.

Okay, that's the mental picture.

And then to explain in that picture, what does these property say?

It says that you are wandering around from state to state, like that.

And then what this says is, if I wanna know the future.

Or predict the future.

All I care about is the current state.

I don't care about the whole history of all of the long wanderings throughout time

to get there.

It doesn't matter how you got there.

It only matters where you are.

That's what the Markov property says intuitively.

So we have these transition probabilities q ij,

and we can write that as a matrix, that's called the transition matrix.

So we can always just draw a picture like this, but sometimes it's easier to

work with a matrix, rather than having to draw a picture every time.

So we can encode the same information as this picture just written in matrix form.

The transition matrix, usually we call Q in this course, but

you don't have to call it that.

That's just the matrix of all the qijs, all the transition probabilities.

So for this example it's gonna be a 4 by 4 matrix, and

then I'm just gonna write, what's the probability of going from 1 to 1?

That's one-third, because there's this loop there,

that's has a probability of one-third.

Probability of going from 1 to 2 is two-thirds, from 1 to 3 or

1 to 4 is 0, we can't go there in one step.

We can go there in more than one step.

Then from state 2, it either goes back to state 1 or

it goes to state 3 with equal probabilities.

So this will be,1/2 0 1/2 0.

From state 3 it always goes to state 4, so it's gonna be 0 0 0 1.

I'm just writing down this q ijs as a matrix.

And then from state 4 it either goes back to state 1 that's probably 1/2 or

it goes to state 3 that's 1/4 or it stays at state 4 that's 1/4.

So that would be the transition matrix.

Notice that every row sums to one.

And, again, it's just a convention, some people would use the transpose

of this matrix instead, and then the columns sum to one, just a convention.

We're going to take the convention that we write it this way, the i,

j entry is the probability of jumping from i to j in one step.

So therefore each row sums to 1.

Columns may or may not sum to 1, but the rows sum to 1.

So we wanna go the other way around and

say, what's a valid Markov chain transition matrix?

Well, we can just write down a blank square matrix, fill in

any numbers you want as long as they're non-negative and each row sums up to one.

And then you can easily see how you would then convert that to a picture like this.

Just drawing errors with probability.

So, that's what a transition matrix is.

By the way, the two most important concepts other than the basic

definition of Markov property, two most important concepts we're gonna need,

one is transition matrix.

Now you know what a transition matrix is, the other one's stationary distribution,

which we'll talk about later.

So, that's the Markov property.

So to tell you a little bit about the history, well, so,

how it's used now, it's being used now, in two sort of different ways.

One is as a model that, might be used

in various problems in the social sciences, physical sciences, and biology,

where you actually, literally believe that you have a Markov chain.

Or using it as an approximation for some system evolving over time.

That's very, very useful.

But that sort of limited in the sense that

it seems like a pretty strong conditional independence assumption.

Where you can forget the entire past as long as you have the present.

So there have been thousands of pages of debates about whether if you take

the stock price over time, is that Markovian or not?

Or is the weather Markov, things like that.

So those become empirical questions, which you can study.

One thing to point out though, is it's not as limiting as it looks in the sense of

that, this is sometimes called a First Order Markov Chain,

where it only goes one step back.

You can generalize this easily to the case

where the conditional independence is that it depends on Xn and Xn- 1.

And it turns out that to understand how that kind of thing works,

the right starting point is to be looking at this anyway.

So you can extend this and

consider, Markov chains where it goes, ten steps back or think things like that.

Anyway that's still still kind of an empirical question, is that useful for

a real system or

not or is that too strong of an assumption about this conditional independence?

But in more another way that this is used for

Markov chain Monte Carlo which essentially

created revolution in scientific computing and is being used everywhere.

And I like to challenge people finding an example of some major

fields of study where Markov chain Monte Carlo has never been applied,

and no one has successfully met that challenge yet.

So say you're interested in French poetry,

you can find Markov chains being applied to different poetry.

But why is that?

Do you really want to think of French poetry as actually following this model?

No, but, the idea of Markov Chain Monte Carlo is that you don't have

to worry about whether the actual process you're observing follows a Markov Chain.

Markov Chain Monte Carlo means that you synthetically construct your own

Markov Chain.

You then can't argue is it Markov or not, because you built your own chain.

Why would you want to build your own chain?

Well, the idea, which is a very beautiful, one of the most useful and

beautiful ideas of the 20th century, in my opinion.

Is that you can construct your own Markov chain

that will converge to a distribution that you're interested in.

Cuz suppose you're trying to simulate some complicated system and

the computations are too hard to do everything analytically and explicitly.

There are some extremely clever ways to construct a Markov

chain synthetically, that will converge to the thing you're interested in.

And so then you just program your Markov chain on the computer, so this is a very

recent thing, cuz we need to have fast computers in order for this to be useful.

It's a recent idea, you program your Markov chain on the computer,

run the chain for a long time and

then use those result to study the distribution that you're interested in.

So that is the idea I'm just gonna write both acronym, MCMC.

We don't really have time to talk more about MCMC than that.

But Markov chain Monte Carlo is just getting more and more popular.

Widely used every year as a way of doing extremely complicated simulations, and

computations that would have been completely impossible, like 50 years ago.

Or even 30 years ago.

Okay, that's Markov chain Monte Carlo.

So Markov chain Monte Carlo is when you build your own Markov chain, right?

And then the other application is just that you have this natural system that

you choose to model it, using the Markov chain.

So those are two important, but different uses of Markov chains.

Okay, so both of those ideas are very,

very different from what Markov himself originally had in mind.

When he introduced Markov chains, it was actually to help settle a religious

debate over the existence of free will.

Kind of a religious philosophical thing.

So at the time, this was the early 1900s,

the law of large numbers had recently been proven.

Like here, we recently proved the law of large numbers.

Okay, and I told you the story of the athlete who was very

upset about the law of large numbers because he said in the long run,

he'd always be average, and so he couldn't improve.

And so he hated statistics.

Which is kind of like ridiculous.

But that was actually very similar kind of a concern that

people had when the law of large numbers was first proven.

Which was that some of the philosophers were upset that

the law of large numbers might prevent us from having free will.

Because it says in the long run, converge to the mean.

Where's the scope for having free will then, right?

I'm saying it will converge to this, right?

Where's free will, okay?

So one of Markov's rivals tried to rescue free will by saying,

well the law of large numbers assumes IID.

And human behavior is not IID, therefore we're okay, right?

Okay, but you can see that that's not a very convincing argument because

we proved that if the random variables are IID, then the law of large numbers holds.

But we didn't show that if it's not IID,

then the law of large numbers doesn't hold and there are more general versions.

So what Markov wanted to do was to find a case that's find a process that's

one step in complexity beyond the IID, right?

That's what this is, right?

The IID case,

would say, you can just completely forget everything you're conditioning on.

And he wanted to go one step in complexity beyond

IID where you go one step backward, right?

So go one step forward in thinking by going one step backward in conditioning.

You're going one step beyond IID.

And then he proved a version of the law of large numbers

that applies to Markov chains.

And so then he said, well look you don't actually need IID for

the law of large numbers to hold, okay?

And the chain that he originally looked at,

which might be the first Markov chain, was he took a Russian novel and

he empirically said, was looking at consonants and vowels.

And he said, okay.

There are two states: vowel and consonant, and

a vowel is either followed by a vowel or a consonant.

A consonant is either followed by a vowel or a consonant.

And then we kind of empirically computed the probability that vowel is followed

by vowel, vowels followed by consonant, and so on.

And he was just studying that simple chain.

So just like this picture, except even simpler.

Only two states, and I drew one with four states.

But conceptually, it's the same thing.

You're just bouncing around between states, okay?

So that's the whole idea of a Markov chain.

And if you keep these simple pictures in mind,

it will make things much, much, much easier.

Even for more, we could have a Markov chain that has 10 to the 100 states,

but you can still keep pictures like this in mind, okay?

So we've defined the transition matrix.

It's just this matrix of transition probabilities.

But I wanna show you how do we actually use the transition matrix to get

higher order things, like what's the probability.

For example, we wanna answer the question,

this is the probability of going from i to j in one step,

what's the probability of going from i to j in two steps, or ten steps?

Things like that.

Those questions can be answered in terms of the transition matrix.

Okay, so let's do that now.

So suppose that we start the chain.

Suppose at time n which we're considering as the present

Xn, the chain, which is Xn at time n, has distribution,

Which I'm thinking of as s, which is a row vector.

Which we're gonna think of as, in other words, we can think of it as a 1xM matrix.

And all I mean by this is, we're taking the PMF, but since we're assuming.

I'm assuming that there are M states, so here M = 4, in this example.

M is the number of ovals in this picture.

So we're assuming M states, and we're saying, okay.

The distribution, I just mean the PMF cuz it's discrete, but

since there only is finitely many values, you may as well just list them all out.

So this is, this thing is the PMF kind of just listed as a vector listed out, right.

So it's the probability of being in state one, probability of being in state two,

and so on.

Just list out the probabilities,okay?

So in particular, we know that s is a vector and we're writing it as a row.

And the entries are non-negative and add up to one.

Okay, so let's assume that that's how things stand at time n,

and then we wanna know what happens at time n plus one.

So we wanna know what's the P (Xn + 1 = j).

So we wanna know the PMF, this is the PMF at time n.

It's just we're writing it as a vector because that's convenient.

That's the PMF at time n, now we want the PMF at time n + 1, one step in the future.

Okay, well, to do that it's not hard to guess that what we should do is

condition on the value now, right?

Condition on what we wish that we knew, it would be convenient if we

knew where the thing is now, that would be very useful, right?

So we're just gonna condition, so

we're gonna sum over all states i of of the P(Xn +

1 = j given Xn = i) times P(Xn = i), right?

Just the law of total probability.

But these are things we know.

Just by definition, this is qij.

That is the probability of going from i to j, by definition that's qij.

And this thing here.

Is just SI, right?

Because that's just the probability of being at i and

that's just the ith value in this vector that's just SI.

I'm gonna write it on the left, because it looks a little bit nicer to me.

So, That's the answer as a sum,

but it's easier to think of this in terms of a matrix.

This is in fact the jth entry of S times Q.

Remember Q is an m x m matrix, and S is a 1 x m matrix, so

it's valid to multiply these.

The inner numbers match up and

then the dimensions of this are the outer numbers, 1 by M.

So S times Q is a 1 by M matrix.

And you don't need a lot of matrix stuff in this class but

I'm assuming you can at least multiply matrices.

How do you multiply matrices?

Well, you take a row of this and do a dot product with a column of this.

And when you do that, it's exactly the sum, right,

you're just going row dot column and you're adding this up, okay?

So the interpretation of this sum is just it's just

the entry of this vector times this matrix.

Therefore, S times Q is the distribution

PMF written as a vector at time n + 1.


So, that's now playing the role of S.

So, by the same argument,

If we do SQ squared, that's the distribution at time n + 1.

N + 2 sorry, two steps into the future,

and SQ cubed is distribution at time n + 3, and so on.

Because it's just the same thing again, right?

Because now S times Q is now playing the role of S, right?

And what this calculation says is to go one

step to the future multiply on the right by Q.

So I multiply SQ by Q we get SQ squared multiply this by Q

you get SQ cubed, and so on, okay?

So what that says is that our powers,

if we multiply the initial vector of probabilities by Q to a power,

then that gives us the distribution that number of steps in the future.

So this is saying that by,

of course this is gonna be pretty tedious to do by hand for most problems.

But using a calculator or computer that can multiply matrices,

then it just say, take powers of Q you don't like think that hard each

time you just use powers of Q and you know how this is gonna evolve.

Okay, so similarly,

By definition Q gives the one step,

Probabilities, so the probability that Xn + 1 = j,

given Xn = i, by definition that's just qij,

Okay, but suppose we wanted the two step thing.

Xn + 2 = j given Xn = i, okay.

This is a similar calculation to what we just did.

This is the transition probabilities.

This is saying go one step ahead in time.

And now this says what happens if we only know Xn but

we want to predict two steps ahead?

Well it's not hard to figure what to condition on, right.

The missing link here is Xn + 1 that's what's missing so,

clearly we should condition on Xn + 1.

So let's say we condition on Xn + 1.

So this is P(Xn + 2 = j given Xn + 1 =,

let's say k, Xn = i, I'm just

conditioning on Xn + 1, right?

Cuz that's the missing intermediary that we want.

Times the P(Xn + 1 = k given Xn = i).

That's just loyalty of probability,

the conditional version everything is given, Xn = i.

Let's simplify that into something that's more interpretable.

First of all, the Markov property says that,

if we know the value at time n + 1 and the value at time n, this is obsolete.

That's irrelevant now, cuz of this conditional independence,

we only want the most recent value, all right?

So we can get rid of that.

So by definition, Now we're just saying we're going in one step from k to j.

By definition, that's just the transition probability qkj.

And then this one is also one step, right?

This is saying go from i to k in one step.

And again I'm gonna write this on the other side just cuz it looks nicer

qik, qkj.

I just swapped the order of those two terms to make it look nicer.

We've expressed it in terms of one-step things, okay.

And now, something that looks like this, that's of this form,

it has a nice interpretation in terms of matrix multiplication again.

This is just the ij entry, row i, column j of q squared.

Because well, why is that true?

Well, just think of, how do you do q squared?

You do q times q.

So you take a row of q and .product with a column of q, and you'll get this, okay?

So similarly same idea, that we wanna know what's the probability,

just repeating the same argument.

We wanna know like, what's the probability that X,

let's say, I don't know,

m steps in the future, Xn + m = j given Xn = i.

So this is saying it's at i now what's the probability that it'll

be at j m steps in the future?

That's just gonna be the ij entry of q to the m.

So, what works out pretty nicely at the powers of the transition matrix,

actually give us a lot of information.

We don't need a separate study of every different transition and

different number of steps.

Just work with powers of transition matrix very, very nice.

Okay, so just a couple more quick concepts.

Well one that's not so quick is the notion of a stationary distribution.

And I'm gonna define it now, but we'll talk about it in detail next time.

But I just want to mention it now just cuz it's one of the most important ideas.

Transition matrix and and stationary distribution are the two

most important ideas other than the basic definition.

What's a stationary distribution of a chain.

It's also called a steady stay or long line or equilibrium, that kind of thing.

It's supposed to capture the idea, so

one reason it's important is if we run the chain for a long time,

we wanna know, does it converge to some limiting distribution?

Okay, and under some very mild conditions, the answer is yes,

it will converge to something that we call a stationary distribution, okay?

So that's gonna be be very important for describing how the chain behaves.

Especially in the long run, but it's also just useful for other things too.

So we say that S, which is a probability vector.

1 by M again.

So again we're just thinking of that as a PMF written as a row.

And the definition is that this is stationary, For

the chain, for the Markov chain that we are considering.

If S times Q Which is, just to make sure this makes sense again.

This is a 1 x M * M x M.

So this is gonna be a 1 by M matrix.

If S * Q equals S, okay?

So I'll tell you the intuition for this definition but first of all for

those of you who've seen eigenvalues and eigenvectors before,

this is pretty reminiscent of an eigenvalue eigenvector equation.

Usually you see eigenvalues and eigenvectors written the other way,

like a matrix times the vector but

if you want to write it that way just take the transpose of both sides.

So that's exactly an eigenvalue eigenvector equation, for

those of you who know what that is.

If you don't know what an eigenvalue is, well you should learn but

you don't need to know it for this class, okay?

Now for the intuition, we just showed a few minutes ago,

we just showed that if the chain F time N follows the distribution S,

then this is the distribution one step later in time, right?

So if SQ equals S, it says if it starts out according to S,

one step forward in time, it's still following S.

That's why it's called stationary, right?

Cuz it didn't change.

And we go another step forward in time, the distribution is still S.

And you go one more step, it's still S, right?

So if the chain starts out of the stationary distribution,

it's gonna have the stationary distribution forever.

It doesn't change, that's why it's called stationary.

Okay, well, there are a lot of questions, we can ask though.

I won't quite call this a cliffhanger but

there's some important questions we can ask, right?

One is does a stationary distribution exist?

That is, can we solve this equation?

Now, even if we solve this equation, if we got an answer that had like some negative

numbers and some positive numbers, that's not gonna be useful, right?

So we need to solve this for S that is non negative and adds up to 1.

So does such a solution exist to this equation?

Does it exist?

Secondly, is it unique?

Thirdly, just now, I just kind of said intuitively that this

has something to do with the long run behavior of the chain, right?

But this definition doesn't seem like it has anything to do with any long run or

limits, right?

This is just S times Q equals S.

So does the chain converge to S in some sense?

And we'll have to talk more about what that means and whether it's true.

Okay and then there's a more practical question.

Assuming that it exists and it's this beautiful thing that tells us the long run

behavior, there's still the question of how can we compute it, right?

So how can we compute it?

So those are basic questions about

stationary distributions, okay?

So under some very mild conditions the answer will be yes,

to all three of these first three questions.

There are a few technical conditions that we'll get into but

under some mild technical conditions.

It will exist.

It will be unique.

The chain will converge to the stationary distribution.

So it does capture the long run behavior.

As for this last question though, how to compute it?

In principle, if you had enough time, you can just use a computer or

well, if you had enough time you can do it by hand in principle.

Solve this equation, right?

Even if you haven't done matrices before.

You could write this out as some enormous linear system of equations.

And if the solution exists in the principle you can find it,

you eliminate variables one by one, okay?

That might take hundreds of years, all right?

So, can we compute it efficiently without spending years doing tedious calculations?

Or even with a computer, it may get too nasty of a calculation.

And so the answer to that, in the very general case is that,

it may be an extremely difficult computational problem.

But we're gonna study certain examples of Markov

chains where we can not only compute it, we can compute it very quickly and

easily without actually having to even use matrices at all.

So there's one really, really nice case that we're gonna look at.

Not today, though so.

Okay, so that's all for today, happy Thanksgiving, everyone.

The Description of Lecture 31: Markov Chains | Statistics 110