Practice English Speaking&Listening with: SAXually Explicit Images: Data Mining Large Shape Databases

Normal
(0)
Difficulty: 0

MONICA: So today, Eamonn Keogh from UC Riverside is here to

talk about data mining time series and images.

And as usual, this talk is going to be archived on Google

Videos, so please don't say anything confidential.

EAMONN KEOGH: Thank you, Monica.

Thanks, [INAUDIBLE].

And welcome to the talk.

So a quick outline.

The first half talks a little more as a review.

I'm going to talk about a quick review of time series

data mining.

And a way of looking at it, the

generic data mine algorithm.

And I'm going to show an explicit presentation of time

series data called SAX.

And again, this is all review.

And then I'm going to talk about in the next half of the

talk some new work we're doing in mining shape databases, and

in particular two problems, shape discord discovery,

that's finding unusual shapes, and shape motif discovery,

which is finding near-duplicate shapes.

And if I have time, I have a DVD bonus feature which I'll

try and get to also.

So let's get right to it.

First of all, what is a time series?

I apologize if--

oh, OK.

So what is a time series?

A time series is simply a collection of data collected

sequentially in time.

And so here's an example here.

And of course, the beauty of this is actually we can plot

this and we can now see it's actually a heartbeat, and

maybe even know what kind of a heartbeat this actually is,

what kind of [INAUDIBLE].

So for the rest of this talk, I'm going to actually tend to

show pictures more than equations and numbers.

Hopefully you'll go along with that.

So I want to spend 10 minutes making the following point--

that time series are ubiquitous.

They're really absolutely everywhere.

And the reason why this is the case is because people tend to

measure stuff.

We measure how many web clicks we get, we measure our

popularity in Google, we measure whatever it is, and

things change over time.

So for example, George Bush's popularity changes over time--

down, down, down, down, down--

and your Google stock changes over time--

hopefully up, up, up, up.

And so the world is full of these time series in medicine,

science, arts.

But the other point I want to make over the next five slides

is time series are ubiquitous for another reason--

because many times of data which are not time series can

be massaged to become a fake time series.

Let me show you five examples.

So here's example one.

I took the Bible as a long sentence and I took a window

of length 100 words and I split it across the Bible.

And now I'm going to measure the local

frequency of certain words.

So for the first example is God.

So for example, in Numbers, God is hardly mentioned.

And in Deuteronomy, it's all God all the time.

So this is like a pseudo time series, and I can use this for

various tasks.

So here's one task.

If I want to translate the Bible into Spanish, or if I

actually want to retrieve the Spanish text, I can look for a

time series similar to the blue one, and I find in

Spanish, I get the word Dias, which of course is

translation for God.

Now, this is actually kind of a trivial example.

There's got to be better ways to translate text.

But this could be actually a bursty time series of, say,

weblog clicks.

And I could find certain words associated with

other kinds of queries.

AUDIENCE: [INAUDIBLE]?

EAMONN KEOGH: Oh, sorry.

AUDIENCE: What is the difference between the blue--

EAMONN KEOGH: OK, I shouldn't have glossed over that, but I

mean, I just made it for fun.

So actually, when I did this originally, I noticed actually

the match is overall quite good.

But in one local area, there's actually quite

a difference here.

And so the question is, why is that difference there?

So I subtracted the red from the blue, and I queried on the

difference of those two time series.

And the closest match is actually this thing right

here, which is El Senor, The Father, a synonym for God.

And so for some reason, El Senor is a synonym for God,

but it's locally bursty.

It doesn't happen in the Old Testament very much, but it

happens a lot in the New Testament,

particularly in Ezekiel.

So there's actually a nice extra bit of value we can get

out of this sometimes.

Here's an example we'll come back to later, but I'll

briefly show you now.

We can also take shapes and convert

shapes into time series.

And again, one reason for doing this is because we have

a big arsenal of tools that handle time series very well

that don't work well in video shape directly.

So we can actually handle things more

efficiently in the space.

Yet another example--

this is actually every professor's worst nightmare, a

grad student with a gun.

So in this case, actually, on the top video, the grad

student is pointing over there.

And here's three examples over there.

Here's I'm tracking her hand.

And in the bottom example, she's

pointing a gun at somebody.

And here are three examples of the gun.

And in this example, if you look at the time series, it's

easy to actually figure out which is which.

Because there's a little bump here you can see which

corresponds to the hand pausing over the holster and

then drawing the gun out.

So we can easily classify this in the time series space, but

it's actually hard to do in the original video space.

This is, of course, a very contrived example.

But for example, in Las Vegas, there are similar moves you

can make with your hand on a blackjack table that people

actually want to know about automatically.

Here's another example of a data set which is not time

series, but we can fake it to become a time series.

So here's a handwritten archive.

I'll just briefly mention, I know Google is interested in

doing this because I was in Dublin

giving a talk in November.

And halfway through the talk, the dean of the college kicked

the door in.

And he said, Google has given us the check.

And the whole room went wild.

So you're obviously giving these guys too much money.

Anyway, if one actually indexed this, we can transform

it a little bit, move the slant, and then we can make a

pseudo time series--

one above the word, and one below the word--

and then simply index the time series, which we can

do very, very well.

It's a very simple, naive thing but actually works much

better to mark off models in these other

tricks in this domain.

Yet another example.

I'll go very quickly over this.

But here, we actually have a four-dimensional brain volume.

And of course, it's not a time series.

But we can actually fill the volume with a Hilbert curve,

measure the local intensity at every part of the brain, and

then take the Hilbert curve and stretch it out--

as I've done here--

into a pseudo time series.

And once again, the idea is simply that we can't handle

efficiently this high-volume data, but we can handle

one-dimensional time series very efficiently.

So the conclusion from all this is that there really is

time series all over the world in every kind of domain.

And now let's do something interesting with it.

So what should we do with this time series data?

Well, we actually want to do everything with it.

We want to do clustered, maybe cluster customers, similar web

clicks, et cetera.

We want to do classification.

Is my heartbeat normal or abnormal?

We want to do motif discovery, perhaps.

So basically, are there any repeated patterns

in this time series?

We might want to do rule discovery.

So rule discovery might say, if ever you see a double peak

like this, then within 10 seconds we're going to see the

shape here with some support and confidence.

We may wish to do query by content.

Basically, where I'm going to Google time series.

Many time series data sets are very, very large.

So we might actually want to visualize massive data sets.

And then finally, novelty detection.

So here's a heartbeat, and something strange and novel

happened here.

The person either died or the machine got unplugged.

So these are a diverse set of problems, but they all have

one thing in common.

To solve these problems, we need to actually figure out

how to measure similarity.

If we can do that in a meaningful way, all these

problems I claim are trivial.

But if we can't do this in a meaningful, useful way, then

all these problems actually are impossible.

So let's talk about similarity very quickly.

So the classic similarity measure

is you put in distance.

And here's the equation, but the picture's

very simple to see.

We have two time series, C and Q. We put them on top of each

other by z-normalization.

And then we can simply measure the distance

between the i-th point--

i to i--

take that distance, square it, sum it up, square root of it,

that's the Euclidean distance.

Euclidean distance is very simple, but

it's amazingly effective.

You have to do an awful lot of work to actually beat this.

This is incredibly simple and easy to do.

Again, we're doing a review here at this point.

We'll get to the new stuff in a moment.

So one thing about data mining is that most of the time in

data mining, you're actually going backwards

and forwards on disk.

CPU time is typically not even relevant.

It's all disk access time.

So to handle that, most data mine algorithms don't work on

the original raw data, whatever that is.

They work on some kind of abstraction of the data.

So here we have a time series in red, but we can't fit it

into main memory.

So we're going to put instead this blue abstraction of it,

and that will fit into main memory.

So here's one example of an abstraction.

I'll show you what it is in a moment.

One important idea is that if you want to measure the

distance, let's say, between Google stock and Yahoo stock

and we can't fit it all into main memory, we can actually

have the approximations in main memory, and we can

measure the distance approximately in this space.

Again, a simple and useful trick.

Now, of course, the piecewise linear approximation is only

one approximation we could have used.

We could have also used wavelets, Fourier transforms,

SVD, et cetera, et cetera.

So we actually have many, many choices of things we could do.

OK, now let me show you the generic data mine algorithm.

And this basically works in [UNINTELLIGIBLE] of anything

I've shown you--

indexing, classification, clustering, rule discovery, et

cetera, et cetera.

It's a basic generic paradigm for mining data.

The first thing we want to do is we want to create an

approximation of the data which actually

fits into main memory.

So we can't fit all of the data into main memory, we're

going to approximate it, and fit it into main memory.

And now, we're going to in main memory approximately

solve the problem at hand, whether it's rule discovery,

et cetera, et cetera.

At this point, though, we have an approximate solution which

might not be exact because we're working on an

approximation of the data.

So the next step, we're going to go to disk-- hopefully only

once or twice--

to either confirm the model or just slightly modify the model

a little bit.

And the hope is that we're going to make very few

accesses of the disk in this stage, rather than many, many

accesses we would do if we worked in the raw data.

So if you believe this model, then all you have to answer is

which approximation did you use?

And there's lots of approximations out there.

This actually is a complete taxonomy of them.

And don't worry if you can't read this.

It's not really important.

The point is, we have lots and lots of choices, and which one

should we pick?

There is one very, very important requirement for any

approximation we use, and this is lower bounding.

It turns out that if we do this generic framework and our

approximation obeys lower bounding--

which I'll explain in the next slide--

then when we do this, we can guarantee the answer we have

at this point is identical to the answer we would have

gotten had we actually looked at the raw data and then all

those accesses.

So this is actually a very nice result.

Again, all I'm saying is that if the approximation obeys

this lower bounding principle, I can do this very efficient

thing and end up with the same answer as I would have gotten

on the raw data.

So what is this lower bounding thing?

Well, we've seen we can make a, for example, Euclidean

distance right here, but again, I can't fit all this

raw data into main memory.

So suppose, for example, I had a piecewise [INAUDIBLE]

approximation which does fit in main memory, and I measured

a distance approximately on that space here.

So here's a measure for it.

Lower bounding says that for any possible two objects, Q

and S, that the distance in the reduced space is going to

be less than or equal to the distance in the true states.

Hopefully, it's going to be very closely approximated, but

the important thing is it's always going to be less than

or equal to.

If this is true again, we can now use

generic data mining paradigm.

Now, there's lots of approximations I mentioned.

This is not even a complete list. But

one of them is special--

symbolic.

And although there's lots and lots of ways of taking a time

series like this and converting it to symbols--

I haven't encountered at least 200 different

ways of doing that--

unfortunately, none is in lower bound.

In contrast, I can lower bound wavelets, Fourier transforms,

SVD, et cetera, et cetera.

I can lower bound all of these.

But unfortunately, of the 200 ways out there, none of these

actually lower bound in a symbolic space.

So why do we care?

Well, just think about what you can do if you can actually

use symbols that you can't do in the real space.

So for example, I can hash symbols meaningfully, but you

can't really meaningfully hash real valued numbers.

I can use suffix trees on symbols, which again, you

can't really use on real value things.

Markov models, for example.

More generically, there are lots of people out there who

are experts in dealing with symbols--

people at Google, for example, or people in bioinformatics

who are used to dealing with [INAUDIBLE]

symbols all the time.

And they have wonderful ideas, algorithms, data structures,

et cetera, et cetera.

And if I can actually have a symbolic representation, I can

steal all of their wonderful ideas.

And again, I can't really do this in this real space.

So a contribution I want to talk about briefly is we've

created the first symbolic representation of time series

that actually allows lower bounding of

the Euclidean distance.

And this means we can do all kinds of

wonderful, clever things.

And I'm going to give you some examples in a little bit.

So our presentation is called SAX.

And again, for clarity, all I'm claiming is this-- that

you can give me a time series like this one here, I can put

it into my little magic black box, and I'm going to spit out

a symbol strings instead.

And furthermore, I can measure the distance between two sets

of symbols, and this will be a lower bound to the true

distance in the original raw space.

And if this is true, then I can use all these wonderful

algorithms and guarantee I get the right answer with my

generic data mining paradigm.

So the obvious question is, how do you get SAX?

It's quite simple.

We have the original time series here in blue.

And I'm going to simply take a window of a fixed size and

measure local areas, look at the average.

So it's the average of this section, the average of this

section, and so on and so forth.

It's a piecewise average approximation.

Once I do this, I'm going to take a little Gaussian curve

here, I'm going to divide it into three equal areas--

so the area here, the area here, and the area here, are

all equal--

and I'm going to extend some break points out.

And anything above this break point is a C, everything below

the break point is an A, and in the middle here, it's a B.

So I can take this real value time series in blue and

eventually spit out these symbols here.

So let me preempt a question people often ask.

So why a Gaussian?

It's true that for long global time series, they may be

actually non-Gaussian.

But for small, local sub-windows, or virtually any

time series, they do tend to be approximately Gaussian.

And we've seen this actually in 128 length data sets in

very diverse domains.

So here's visual proof, if you like.

And again, I can show you this in many, many

different data sets.

So the idea is, by using this technique, we guarantee that

we have approximately equal symbols.

We have as many A's as B's as C's.

And most things we want to do, like Markov models and hashing

and suffix trees, have their best cases if you have an

equal number of symbols, equal [INAUDIBLE] symbols.

Sir?

AUDIENCE: Is it only [INAUDIBLE]?

EAMONN KEOGH: No, this is actually arbitrary.

And we can choose the number of symbols we want to use.

AUDIENCE: This function is just to discretize the data,

though, right?

EAMONN KEOGH: That's all it does so far, yes.

AUDIENCE: So it could be linear, it could be any other

distribution.

EAMONN KEOGH: It could be.

So actually, we thought of making this adaptive at some

point originally, but when we made it adaptive, it always

became a Gaussian basically.

So we simply hard-coded that fact.

I should say actually that if this actually is a

non-Gauusian time series, then the correctness of all the

algorithms doesn't change.

Everything is still correct.

It might be less efficient, but we're pretty safe there.

Here's a quick visual comparison.

Here's a time series and here's its SAX representation.

And visually, here's a comparison with these other

approximations.

Visually, we don't really lose anything, right?

The real reconstruction error is about the

same in most cases.

I can always find you some examples that make one of

these look better than the other, but on average, it

tends to be that SAX actually works as efficiently as the

other things in terms of reconstruction error.

Again, I told you there's a lower bound here.

I'm not going to actually prove it or derive it.

But I want to re-emphasize that it actually is true, that

the Euclidean distance between two time series is measured

like this, as we discussed.

And in the SAX space, this function here, de minisis

function, actually gives us a distance which approximates

this distance and will always lower bound.

And we can prove it by transitivity.

Basically, this is known to lower bound this, and we can

show that we can lower bound this.

So we now know that we have a lower bound presentation.

AUDIENCE: [INAUDIBLE]?

EAMONN KEOGH: There are two parameters in SAX.

One is the cardinality alphabet, A, B, and C is

[INAUDIBLE] example.

And the other is the number of sections here.

And hard choices have to be made here, unfortunately.

Although, actually, it's not critically sensitive choices

in most cases.

OK, just a bit of a history of SAX and then were' going to

get into this new stuff with shapes.

So it was invented in 2002.

And the first thing we did is we showed that for most

problems, it basically works as well as Fourier transforms

and wavelets, et cetera, et cetera, for all the classic

but boring problems--

classification, clustering and index.

And in 2003, we showed that actually SAX allows linear

time discovery of time series motifs.

So if you've got a time series and you have a repeated

pattern in it, we can actually find those in linear time.

It's been referenced 60 times, but actually most of these

times are actually applications of this.

People now use this technique for motion capture, they use

it for telemedicine, mining medical

data sets, and whatnot.

So people actually really do use this quite a lot.

And in 2004, we showed how to use SAX to do parameter-free

data mining.

Last year, we had a real explosion in things we're

doing with SAX.

So we'll do an anomaly detection, visualization of

massive data sets, joining streams, et cetera, et cetera.

And recently--

for example, this year at KDD--

there was many papers, actually submitted at least,

that used SAX as wavelets and Fourier transforms. So I

really think that SAX is getting some critical mass and

I think it's actually going to take off quite a bit.

So this has all been review.

Today I'm going to show you that with SAX, we can actually

now do data mining on shape data sets and do very

interesting things in that domain.

AUDIENCE: [INAUDIBLE]?

EAMONN KEOGH: No, no.

Any solution is fine.

AUDIENCE: So uniform is fine.

EAMONN KEOGH: Any solution you want.

Uniform is fine.

But in general, Gaussian will give you equal probable

symbols and uniform will not, that the A's will be very

rare, and the B's will be more common, and then up the

alphabet, those symbols are less common.

AUDIENCE: But isn't it that you need to [INAUDIBLE].

This is the simple approach to 30% of [INAUDIBLE].

EAMONN KEOGH: Uniform actually, we can handle

uniform if we want.

But in general, virtually every time series has this

Gaussian property.

It's kind of hard to believe, and reviewers often don't

believe it.

So we actually built a web page.

And for 100 data sets, we showed it's true.

And 100 originally, And we pointed our viewers to that.

People's intuition happens to be wrong there.

Again, globally, most time series are not Gaussian.

But small, local windows virtually always are.

AUDIENCE: [INAUDIBLE]

intuition [INAUDIBLE]

when I'd like to present my time series as symbols, what I

will do is simply the same, maybe not this [INAUDIBLE]

but I would take the high values and name them A, B, C,

D with uniform solutions.

EAMONN KEOGH: If you do that, you will have symbols have

different frequencies.

And you generally don't want that.

You generally have equal frequencies.

Because again, for Markov models, for hashing, for

suffix trees, it's always better to have

equal probable symbols.

But you could do that if you wanted to.

And SAX would allow that if you wanted it.

That would be fine.

Question?

AUDIENCE: I feel like you answered it, but why do you

want the equal probable symbols?

[INAUDIBLE].

EAMONN KEOGH: Empirically, that's true.

You can't prove it because there can always be a special

case it's not true for.

Empirically, that is the case that the lower bound is

maximumally [INAUDIBLE] when that's true, yes.

AUDIENCE: Because also, this relies on the shape of the

time series.

Because the time series starts at very low values and then

grows larger and larger.

Do we need more symbols to represent this time series?

EAMONN KEOGH: The number of symbols is fixed as a user

choice with a user parameter.

So that isn't the case.

AUDIENCE: [INAUDIBLE]

the number of symbols to be only three, for example.

OK, I'm going to [INAUDIBLE] that the number of frequencies

would be similar.

But three is not enough for this type of series.

So is there a way to figure out which number is

[INAUDIBLE]?

EAMONN KEOGH: In general, we impose three or four as the

number of symbols.

But actually, you could try to learn it using, say, minimum

description [INAUDIBLE]

like that.

And we've done some work on that.

It's not being So today, because we found basically

three symbols for virtually any kind of problem, any kind

of data set tends to work the best. So we did what we did to

make SAX an adaptive [INAUDIBLE]

symbols, but really it becomes a waste of time.

The symbol thing just works amazingly well.

So I hinted at this earlier on, but I want to show it to

you again in more detail, that we actually can convert shapes

to time series.

So here we have a shape.

And all we do is we find the center point, the mean value

of the shape.

And now we're going to actually have a clock hands go

around 360 degrees, measure the local distance here, and

that local distance becomes the height of y-axis.

So some observations about this.

First of all, there's other ways of doing this, but this

way I actually like quite a lot because first of all,

there are no parameters-- it's completely parameter-free--

it's very fast and easy, and it's very, very intuitive.

So why do this?

Why take shapes and map them into time series?

Well again, there's a whole arsenal of tools we can use in

time series space.

Let me give you some quick examples.

This is the famous Brodtmann image of two

howler monkeys idealized.

Here's the real deal.

Here's some real howler monkeys.

And you actually can see that Euclidean distance here will

really be quite intuitive, that howler monkeys can be

very similar to a similar howler monkey, and Euclidean

distance can reflect that, and life is good.

So if we actually query with this, we'll probably find this

thing here.

Another tool in our arsenal is dynamic time warping.

So in some cases actually, the shapes might be

morphologically changed.

So for example, different gorillas might have a

different permanence of the brow ridge, or a different

permanence of the jaw, but in a time series space, that

simply becomes the fact that these peaks move backwards and

forwards in the x-axis.

And again, we can beautifully handle that within dynamic

time warping.

AUDIENCE: A question.

If there are phase shifts, so how do you align this stuff?

Because--

EAMONN KEOGH: So [INAUDIBLE]

great questions.

I think the question's rotation

and variance basically?

So if one skull is rotated around 90 degrees?

Lovely.

We actually can handle that, but let me just push that

aside for the moment.

I actually brought some papers to show you, or we could talk

about it offline.

This is actually the general question.

How do you make sure the two skulls are aligned in the

right direction?

And for the moment, I'm going to ignore that, but we

actually can handle that easily.

AUDIENCE: Why does the purple line cross the orange line?

EAMONN KEOGH: Oh, it looks weird but it's only because

you can't--

everything actually mounts correctly, but because one's

above, one's below, basically.

[INAUDIBLE] actually changed around.

It looks weird at first, but if I took this one here and

pulled it down here, [INAUDIBLE] the

lines wouldn't cross.

AUDIENCE: Oh, I see.

EAMONN KEOGH: It's a bit visually distracting.

But if I took this and pulled it down here,

that will be good.

AUDIENCE: Question.

[INAUDIBLE]?

EAMONN KEOGH: Oh, sure.

Well, [INAUDIBLE]

we took normalized these with all unit length.

So the size and shape makes no difference.

Obviously, we have a skull that's five times as big as

another skull, it's probably just that the photograph was

taken from a different distance.

So we're normalizing all unit length, basically.

And sometimes even dynamic time warping isn't enough.

This actually is the famous school 5 skull.

And when you actually see it in universities, typically we

have this example here where they filled in the missing

parts in the poxy.

But the real actual skull that was found is basically missing

all the nose region.

Now, if I want to match this, it's problematic because if I

insist upon matching the missing nose region, I'm going

to get pathological results.

So once again, in the time series arsenal, we actually

have a tool, longest common sub-sequence, which basically

says, try to match this shape, but if it's too hard to match

something, basically give up.

And so here, it matched everything beautifully, but

just the nose part, it essentially gives up

in that part here.

It knows it's like a wild card.

I'm not going to match this.

So again, here's some examples that simply show that we can

actually take these shapes, convert to time series, and

then throw lots of different tools at them.

But I want to point out, actually, the Euclidean

distance tool works exceptionally well by itself.

So here's a collection of primate skulls based upon the

Euclidean distance.

And it's not the correct evolutionary tree.

So for example, the humans here should be closer to the

orangutans obviously.

But it is subjectively sensible.

And at the lower levels of the tree, all these red things

here, all these blue things here, et cetera, it is

actually giving us the correct sub-tree for the tree of life.

Notice by the way that the humans are an outlier here

because of their massive brain case.

So again, the point simply at this point is that we can take

shapes, convert the time series, and then we can

meaningfully measure similarity in that space.

Once you believe that, [INAUDIBLE]

actually some real original data mining shapes.

And again, this is some ongoing work.

We're going to show some interesting things we can do

with some shapes.

So the first thing I want to look at,

we call image discords.

Let's say I give you a database of shapes, it could

be quite large, and the question is what's the most

unusual shape in the collection?

So you can look at this visually.

Probably most of you will grab this one here.

And this is actually the correct answer.

That was the most objective.

So we actually have an objective definition for this.

We call it shape discord.

And the intuition is we want to find the shape which has

the maximum distance to its nearest neighbor.

So in terms of islands, you think of

Hawaii as being a discord.

Its nearest neighbor state is probably California or Alaska

which is very far away.

Whereas Nevada has a very close nearest neighbor.

So of all the possible shapes in here, which one has the

maximum distance to its nearest neighbor?

And it happens to be the shape right here.

So why is this useful?

Let me show you some examples.

So here's a fairly large collection of fruit fly wings.

And often people modify the genes and then see what

difference it makes in the phenotype of these insects.

And actually, if you get the wrong gene, like a [INAUDIBLE]

gene, you can actually make an antenna sprout out of the wing

[INAUDIBLE] for example.

So in this case actually, we mined this.

And this one popped out here.

It's clearly not a phenotype.

It's simply someone damaged a wing basically.

But the point is that we can actually find this unusual

shape quite efficiently, as we'll see.

AUDIENCE: Is it similar to outliers?

EAMONN KEOGH: It is somewhat similar to outliers, but there

are some kind of differences.

Outliers in Euclidean space are easy to

find relatively speaking.

But in this case, because we have a rotation in variant

metric, we can actually embed into Euclidean space directly

as we'll see in a moment.

But it is somewhat similar to outliers, yes.

So here's another example.

Here we have a very large collection of petroglyphs--

basically, ancient graffiti.

And we want to actually find are there any unusual shapes

in this data set?

This of course is a subset of it.

And so we found the number one discord here is the shape

right here.

And so the question is why?

If I take a different shape, let's say this one here, and

look at its nearest neighbor, which happens to be this one

here, you can see actually, they quite match very well.

Some small local differences, but they're

basically very similar.

And this is true for almost all of them.

They have some similar shape in the database.

But this one here doesn't.

Why is that?

So here's actually the red one and the blue one together.

There's a local difference here.

And the explanation is that this one uniquely has an arrow

sticking into it.

It's a big horn sheep, and someone stuck an arrow into it

which caused this large increase in the time series

space here.

As it happens at the visual [? check in, ?] this is the

only one that actually has this arrow in it.

Two more quick examples.

Here's a collection of projectile points or

arrowheads.

And it's hard to estimate how many of these things there

actually are.

At UCR alone, we have 1 million of these things.

Unfortunately, most of them are not

photographed or indexed.

But with a small subset of these, a few thousand of them,

we run the algorithm and it pops out this one here.

And actually, I was quite surprised to know that

arrowheads can be asymmetric.

And finally, some kind of special fishing

tools look like this.

And this example right here is a subset of a large cell we

looked at, a large slide we looked at.

And most blood cells tend to be round, which is like a flat

line in time series space, or a figure-eight shape.

This basically corresponds to either two cells put together

or a cell dividing.

And again, this isn't unusual because it looks like this,

and that's not unusual because it looks like this, and so on

and so forth.

But one of them, this one right here which comes from

here, is unusual.

Because actually it's tear drop shaped.

And it has no near match in the data set.

And as it happens, I think tear drop shapes are

indicative of a kind of anemia basically.

So after these examples right now, hopefully you believe

that this definition can be useful for looking at large

data collections.

I guess the question is now, how do we find these things

efficiently?

Well, first of all, how do we find them in general?

Here's the actual MathLab code.

It's very simple to find them, but unfortunately,

it's very, very slow.

So let me explain the algorithm by

looking at this matrix.

We don't need to build a matrix, we can actually simply

compute things on the fly.

But it's easy to understand if I actually look at the matrix.

So the code basically is two nested for loops which simply

walk across each column.

And then within each column, visit every single row.

And they look for the smallest value

apart from the diagonals.

So here I come down, I find that in this column here,

1.1's the smallest value.

That's my best so far.

And I keep looking for those.

At this point, I'll find here that 7.5 is the smallest value

in this column, and it's actually the

largest smallest value.

So in this case, this is actually the

discord in column six.

So the good news is it's a very

simple, trivial algorithm.

The bad news is, of course, it's quadratic.

It's actually constant space, because I don't have to keep

this in memory.

But I actually have to visit all the cells here, so I have

a quadratic algorithm.

And the question is, can we do better than that?

And the answer is, of course, yes.

So one simple observation is going to help us here, which

is suppose I'm searching across here.

So I've found in this column 1.1.

That's my best so far.

And then here, for example, I find 2 which

is my best so far.

When I get to the next column, as I scan down, if I find the

value here, apart from the diagonal one, which is less

than my best so far, I can stop.

When I find the 1.2 here, I know there's no point in

continuing on because this is not going to be the discord.

And so I don't have to visit this cell or this cell.

So a very simple pruning technique can

help us quite a bit.

AUDIENCE: [INAUDIBLE]?

EAMONN KEOGH: So in this simple example, [INAUDIBLE]

constructed.

But of course we don't really [INAUDIBLE] make this.

We build it on the fly, measure these

things on the fly.

OK, so actually potentially could help a lot.

And there's one other trick that could

help us quite a bit.

We can actually have heuristics, called outer and

inner, which tell us the order in which

to visit these columns.

So I'm assuming a for loop right now that goes from left

to right and from top to bottom.

But I could simply permutate that and go on any

permutation.

I could visit this one, then that one, then that one.

And then within each column, I can visit this one, then that

one, then this one, et cetera.

And if we're very careful, this can make a huge

difference.

If we actually visit them in a special order, things get a

lot better.

How much better?

Let's imagine a magic heuristic which is

given to us by God.

Well, what would a heuristic actually do?

It would tell us that we should visit the columns in

order of discord value.

In other words, we should actually jump to the real

discord first, and then the second best discord second,

and so on and so forth.

If that happened, then we would find a large value here,

7.5, early on, and we could do a very good pruning.

And likewise, for the inner heuristic, if we were magical,

what we would do is that when we got a column, we would jump

to the smallest value right away.

We'd go straight to the nearest neighbor, and it would

be less than our best so far.

And we could stop.

We wouldn't have to look at the rest of the

things in the column.

AUDIENCE: What's [INAUDIBLE]?

EAMONN KEOGH: Yeah, so the analysis basically is this.

If we had this magical algorithm, the first column we

went to, we'd actually visit every single thing--

that's order or n.

And then an order of n number of times, we make one

[INAUDIBLE] move.

We'd jump to the nearest neighbor and we'd stop.

So basically, it'd be 2n, which is linear.

AUDIENCE: But we don't know the nearest neighbor--

EAMONN KEOGH: No, no.

I agree.

So this is a magic algorithm.

This is what God can do.

We can't do this.

OK, so if we were magical and God has given us the

heuristics to order these things, we've actually gone

from quadratic to linear.

So line potentially is very good, but the question is can

we actually do this?

So the answer is we can't actually do it, but we can

approximate it very closely.

So the two observations are this-- that although we'd like

to visit the columns in this magical order, as long as we

actually get a reasonably high value early on, it really

makes no difference.

So early on in our first few attempts in the outer loop, we

want to find a pretty high value here.

And if we do that, we can prune quite well.

And the other observation is that for the inner loop, we

don't really have to jump to the nearest neighbor in

constant time as long as we jump to a near enough

neighbor, a neighbor that's actually less than the current

[INAUDIBLE] so far.

So if it's the fifth nearest neighbor or the 20th nearest

neighbor, it might be near enough to stop.

So based on these two observations, we're going to

try to approximate the magic algorithm.

AUDIENCE: So is this basically a [INAUDIBLE] then?

Or similar to [INAUDIBLE], where you're just picking a

random value and then trying to jump around the space to

find a [INAUDIBLE]?

EAMONN KEOGH: No, we do it better than random

[INAUDIBLE].

So the answer to this problem is actually going to

be based upon SAX.

What else?

We take our shape, convert it to a time series, convert the

SAX, and now we're going to put these SAX words, c-a-a,

into this little matrix here.

So this is the first shape, c-a-a, and so c-a-a goes here.

And I guess the next shape was c-a-b.

And that goes here.

And so on and so forth.

So I'm going to populate this [INAUDIBLE] matrix here.

As I'm building this, I'm also going to insert these words

into a suffix tree.

So for example, image one was c-a-a.

And I put a 1 here because image one came from there.

And then for image two, c-a-b, c-a-b, I put a 2 here.

And for image three, it's also c-a-a.

So c-a-a I have an overspill.

I have a 3 here.

OK, so hopefully you believe me that actually I can build

these in linear time, and basically constant space, in

fact, a very small constant space.

So [INAUDIBLE]

built these, and I'm going to use these

to approximate magic.

So the first question is, I want to visit the outer loop

where I'm going to find a likely discord very early on.

And how do I do that?

Well, I have this count here--

1, 2, 3-- and I can map it back into here.

So everything that is here--

c-a-a, this one here gets a value of three.

This one here gets a value of three.

This is simply a count of the number of words that happen to

map to the number of those words.

So the intuition is that if it had an unusual shape, it

probably had an unusual time series, which probably had an

unusual SAX approximation.

So its count here is going to be very low.

In contrast, if I have a very common shape, it's going to

have a very similar time series, very similar SAX

words, and the count probably will be very high.

So my outer magical heuristic is going to be this.

Find the smallest value in this column, and whatever that

is, I'm going to jump to that value first. The value's

typically one.

And if I have some ties, I'll jump to them.

I'll take all the ties and go to those first.

So my simple approximation for the magic outer heuristic is

find a small value here and then go to those values first.

In the inner loop, the same intuition works here.

So as I'm going through my search, and I want to see if

this is the discord.

Or if not, I want to find out very quickly.

The intuition is I want to find similar things early on

to just be able to prune.

And how am I going to do this?

Well, similar things to this, c-a-a, are likely to be either

1 or I guess 731.

Because they map to the same SAX word, they're probably

similar, they probably have a very small nearest neighbor

distance, and I can actually abandon search early on.

So this is my heuristic for the inner loop.

And my hope is with those two simple heuristics, I can

actually approximate the magical algorithm.

So this is ongoing work.

I don't have very detailed results here right now.

But on a small data set, like 200 arrowheads, we find that

we actually get within 22% of the magic algorithm.

And the good news is for larger and larger data sets,

we actually get better and better and better.

We get asymptotically close to the magical algorithm.

Essentially, we're solving a quadratic

problem in the near time.

Now, one small problem is this is actually empirically true

in all that we try.

But in the worst case, it could still be quadratic.

If it gave me a database where every shape is identical, I'm

in trouble, right?

But for realistic data sets with some kind of variance, we

actually get basically a linear time algorithm.

OK, let me show you a complementary problem.

And I'll do this at kind of a high level details to be a

little bit faster.

Now we have a data set.

And the question is are there any duplicates

in this data set?

So trivially here, I guess we have this and this duplicate.

And then the question is again, how can we find these

duplicates and find them efficiently?

And once again, we have a reasonable

definition for this.

Basically, we're trying to minimize the Euclidean

distance between the pair of objects.

And once again the question is, is this useful?

And I'm going to just be very brief and show

you this one example.

So here we have a data set of about 4,000 different

petroglyphs from all over North America, including

Canada and Mexico.

And the question is, are there any similar ones?

And subjectively, there are similar ones, but what are the

most similar pair in this data set?

And actually, here's the answer.

And it's actually quite interesting.

First of all, notice actually we had to be rotation in

variant to find this.

So they're similar, but one's rotated 90 degrees.

What's also interesting is actually that these two came

from locations which are only 150 miles apart.

So potentially, there's actually some

meaning to this, basically.

Although, I actually don't know at this point.

OK, so image motifs are quite interesting in this example.

But how can we find them?

And again, for time reasons, we're just going

to gloss over this.

But basically, the idea is, once again, SAX.

We can use a SAX presentation.

And as it happens, repeated patterns in DNA strains are

very, very important.

So we're going to steal someone else's algorithm, take

our shapes, convert them to SAX, and then find DNA

repeated patterns essentially, and then because its lower

bound in property, we get the data back to the generic

algorithm, we do some cleanup work, and in basically linear

time, we can find the repeated patterns using someone else's

[INAUDIBLE].

And I think the important thing to point out here is

that you can really only do this with SAX.

It's hard to imagine how you'd do this with wavelets or

Fourier transforms or anything else.

Because it relies upon the ability here of hashing in the

random projection algorithm.

Actually, I'm going to conclude here.

But I have a five minute bonus feature which I will try to

get-- is that OK?

AUDIENCE: I think so.

So you can [INAUDIBLE].

You've got 13 minutes.

EAMONN KEOGH: Lovely.

So conclusions, and again, I'll go on for

another five minutes.

But the first conclusion is-- and you probably already know

this basically--

the representation is everything.

With the right representation, you can solve any problem

efficiently.

And for time series data mining, SAX is probably the

best answer for most problems. But the nice new result here

hopefully is that SAX is also probably the best answer for

all kinds of problems with shapes too.

And I really can't overemphasize this.

I really believe that SAX actually is the way to go for

virtually all kinds of problems in time series and

shape data mining.

You can just do things when you have symbolic

representations that you can't do in the real space.

And you can simply use so many clever tricks.

Again, I will go on after this, but my conclusion was

going to be, if I had enough time, SAX appeal.

I'm always interested in new collaborators and I'm always

interested in new problems, new data sets, and of course,

I'm always interested in money.

OK, so here's the five minute bonus DVD

feature, if you like.

We can actually use SAX to visualize very

massive data sets.

And here's how we're going to do that.

Let's start with a digression.

Here's some DNA from two different animals.

And the question is, are they similar animals?

And obviously, there's no way you could tell

by looking at this.

And of course, this is actually a small subset of a

genome which is about 3 billion

base pairs in a mammal.

Let's try a little trick.

Let's build a 2x2 matrix, and let's label them ATCG

arbitrarily.

And let's actually count how often we see those letters in

the DNA sequence.

So in this case, 20% of the time we saw a C, and 24% of

the time, we saw a T, and so on and so forth.

I can actually make more complicated matrices, because

basically recursively do a little trick here, and then

count every possible combination of two.

C-C, C-T, T-C, et cetera, et cetera.

And I can do this, of course, at any level I

want to, if I want.

And now I count how often see C-C's or C-T's,

et cetera et cetera.

Having done this, I now have a matrix of numbers.

And I can map those to colors.

I can numerize all the colors arbitrarily from 0 to 1.

And in this case, I see that this pixel here, for example,

occurs 7% of the time.

So I can look up 7% here and paint it red.

Once I've done that, I have converted DNA strain into a

colored bitmap.

And I guess the question is now, so what?

Let's actually look at some files of DNA, under Microsoft

Windows, the classic thing, Notepad I guess.

And here are the icons for Homo sapiens, Loxodonta

Africana, et cetera, et cetera.

And looking at this, you have no way of knowing are these

things very similar.

But let's actually do this now with this trick

I've just shone you.

Let's replace the normal icon with the icon I've learned

with this approach I've shown you in the previous slides.

When I do this, I see this.

And it's strongly suggested that these two things look

alike and those two things look alike.

And as it happens, this is the Indian elephant, the African

elephant, and this is the chimpanzee and the human.

So it's sort of intuitive, it makes some sense.

You can look at these bitmaps and figure out, these are a

part and these are a pair.

OK, that was a lot of digression because I'm talking

about time series.

And the question is, can I do something

similar for time series?

And so what?

Would it be actually useful?

And the quick answer is, of course, yes.

We have SAX in our arsenal of tools.

And with SAX I can make a string, a pseudo DNA

sting, if you like.

And once I have a pseudo DNA string, I can map it into a

matrix, I can get the colors, and life is good.

So I can do this very fast. The question is, is it useful?

So here's the first example.

Here are some brain waves from rats.

And if you look at this, it's sort of like the Sesame Street

game of which one of these things is unusual?

And the answer, I guess, is this one looks different from

those once.

As it happens, these are all epileptic rats,

and this one isn't.

OK, let's try this.

I can actually add value to this by mapping them into

Euclidean space based upon multi-dimensional scaling.

I'm basically going to find these similarities and cluster

them in two-dimensional space.

So this actually data set is monthly power demand for

Italian electrical power.

When I do this, I see that actually all my winter months

are over here and all my summer months are over here.

But once in a month, August is completely unique.

It's out there.

So why does it look different, and why is it far from these

things here?

And actually, it you're Italian or you're European,

you might know this.

In Italy, in August, the entire country closes down.

[INAUDIBLE] closes, the government closes, and

everyone goes to the beach.

And actually, we can see this if you plot

the data right here.

Now, of course, in this case we could actually see this

anomaly if we plotted the data.

But I guess the utility of this is you can see anomalies

without actually opening MathLab or Excel, whatever

tool you actually have. And in any case, this anomaly here,

or this cluster here, is too subtle to see to my eye in

this space here.

But we can actually see it better in

this space right here.

Here's another example.

This is a famous data set, at least in data mining, for

testing [INAUDIBLE].

And it allegedly contains 70 heartbeats.

Here's a subset of them.

Once again, in the classic view, nothing really is added

to the value here.

But in the bitmap view, we see this.

There appears to be too strong, strong clusters--

these guys here and those guys there.

So what's going on?

Why are these guys so different from these guys?

As it happens, they're not heartbeats.

A grad student many years ago made a mistake.

So these five things here are not heartbeats.

They're what's called pacemaker cells--

but natural pacemakers, not electronic ones.

So no one has noticed this, even though many people have

actually used this data set to test algorithms. And again,

the value here is that we could actually find this if we

plotted the data and looked at it potentially.

But we get it here for free, essentially.

When we copy or paste or edit or move files around, for

free, we can actually see what's going on.

This is my final example, then I'm all done for the day.

My hope actually is to use this for basically almost like

a Google toolbar tool for query by content.

So I can search for documents based upon the data that was

created, its file size, its type, et cetera.

And I can also search based upon keywords.

But let's think about a more general search, a search upon

structure, where you simply say, find me a file that's

similar to this one.

Here's an example of that.

It's completely contrived, but it makes sense.

Here we have a data set of large mammals in the Americas.

And I right click on the brown bear and I ask

for its nearest neighbor.

And so we search quickly through an index, and the

nearest neighbor is the polar bear.

And you can kind of tell visually, actually they look

similar by these two icons here, or

bitmaps here and here.

As it happens last night on CNN, they announced that

actually they had found the first polar bear/brown bear

hybrid in the wild, which was an interesting coincidence.

So I've shown you actually I can take time series

and map into this.

And I've shown you how I can take DNA and map it into this.

And right now, I'm working on doing this for text.

So using [INAUDIBLE] index [INAUDIBLE] something else, we

convert text into bitmaps.

And it actually works quite well.

If you look at my papers, they all look kind of similar,

because I only do one thing, basically.

If you look at Shakespeare, it tends to look a bit different,

and so on and so forth.

And so the question is, can we actually use this to do a fast

search by type?

Basically, find me similar documents to this document

right here.

Great.

I thank you for your indulgence.

And now I really am finished, and so I'll be happy to answer

any questions you might have.

AUDIENCE: So what are some specific application areas

where you see this being applied?

I guess you did refer to [INAUDIBLE].

EAMONN KEOGH: You mean for the shape thing?

Well, one [INAUDIBLE] you could probably see is

anthropology.

So anthropologists typically have lots of data.

As I mentioned, a million arrowheads.

In one part of Nevada alone, in 10 square miles, there's

evidently about 100,000 petroglyphs.

And amazingly, people don't know

anything about these things.

It surprised me.

So anthropology's one possible example.

Definitely looking at things like cells in biology.

It's very tedious to actually look at every single cell.

The fruit fly wings actually makes a lot of sense too.

Because [INAUDIBLE] make lots and lots of data like this.

So those are some potential applications.

I'm hoping people will suggest more when this work gets out

there a little bit.

And in this example what we're actually doing was we're

looking at long video sequences of,

say, martial arts.

And the idea is, we are trying to find two video sequences

where the person actually has the same pose.

If we have that kind of a collision, then we can

actually run the video and then jump from one video

stream to a different video stream.

So basically we can make fake-- it's called video

textures, where we can combine different

moves, kicks, and punches.

People have tried this in motion capture data.

And it's very, very hard to do.

In the simple example to have a shape-- of which I don't

have any with me right now-- it actually works

exceptionally well.

So basically, for video games, it's quite useful to do this.

We actually have a kick and a punch, and we have

a kick and a fall.

And we can actually combine those. any combination of

those, basically.

AUDIENCE: It seems like you're focusing on silhouettes also,

instead of using more complex algorithms like over curves in

2-D space, or something like that.

EAMONN KEOGH: So we are looking at silhouettes

obviously for simplicity.

For many shapes, you can look at a silhouette and you can

actually tell what it is.

Virtually, that's always the case.

Certain people's faces, it's very easy to recognize

people's faces based on that.

For some shapes, of course, the color, the texture, or

internal features are important.

And currently, we're not looking at those.

But we have to do something to initially start this.

But that is potentially future work.

Any more questions?

AUDIENCE: I wonder if you have any thoughts on applying your

anomaly detection to surveillance [INAUDIBLE]

applications.

EAMONN KEOGH: We do.

For one reason, it's because it's very heavily funded.

And people like that kind of stuff.

So we actually are looking at them [INAUDIBLE] detection in

that kind of domain.

So you could think about unusual

shapes and match shapes.

So for example, even things like matching suitcases, based

upon the shape, they figure out if it's been transferred

between the streams and whatnot.

So it's something that we're actually looking

at at a high level.

The Description of SAXually Explicit Images: Data Mining Large Shape Databases