Follow US:

Practice English Speaking&Listening with: Statistical Aspects of Data Mining (Stats 202) Day 1

(0)
Difficulty: 0

>> MEASE: So, I'll try and give you guys some background about what's going on here if you

havenít figure it out from the e-mail. So, this is a class called, "Statistical Aspects

of Data Mining." I'm teaching it at Stanford this summer. Today was the first day at Stanford

just like today is the first day here. Basically, I just had the idea that I'm teaching the

course at Stanford anyway, it seems like something that might be of interest to people here at

Google. So, I work here at Google and I just teach the class at Stanford. So, I thought,

well, since I'm going to come to work everyday after I teach the class, why don't I just

go ahead and teach it here? And so people that are interested can sort of sit here at

Google and take the class here. Because of that, sort of the slides and everything you

see is basically the exact same as the slides that I'm presenting when I'm at Stanford.

So, some of the things like, for example, number four up here, the fourth thing we did

at Stanford today was I took pictures of the students so I could have pictures of them.

Obviously, I'm not going to take pictures of you because your pictures are on MoMA,

et cetera. So there's going to be certain things that obviously donít apply to you

and you just sort of have to be a little bit patient with that. But for the most part,

I'm going to follow exactly the script from what I do at Stanford. And to that end, the

outline for today is basically to go over the information on the course webpage. Run

through chapter one on the textbook--I'll talk about the textbook--and then talk about

the software that we're going to be using for this class and how you can go about getting

it. So, if there--unless there's any sort of pressing questions before we begin, I'm

going to start going through these slides. Question in the back?

>> [INDISTINCT] hear you back here. >> MEASE: Okay. So, is that better?

>> Yes. >> MEASE: Okay. I'll try and stay close to

the microphone and I'll just try and speak loudly when I walk over to the board. If you

can't hear me, let me know. >> You have a lapel mic. Maybe...

>> MEASE: Yes, but they said this is just for recording, for the video conferencing.

So maybe we can figure out how to get that to work over the speaker. Yes?

>> Textbook? >> MEASE: Textbook, yes, we'll talk about

that right off the bat. Okay. So, textbook, okay. This is available on Amazon for about

80 bucks. It's called--well, I'll show you on the webpage. It's called "Introduction

to Data Mining." Authors are Tan, Steinbach, Kumar. We're trying to order some copies for

you guys but there's no way we're going to have enough. So, if you come next time, I

may have some to give you via some lottery system. But you can go ahead and order it

online if you haven't already. It's a good book to get if you want to just sort of, you

know, hope your officemate will order one and, you know, you guys can share. And a good--it's

a good book to get and we are going to be following it quite closely and I'll talk more

about it in a second. >> Is it first edition or second edition?

>> MEASE: There's only one edition and I've been told there may be a paperback available

if you're trying to save money. But there's only one edition as far as I know.

>> What are the author's names? >> MEASE: Pardon me?

>> Can you repeat the authors' names? >> MEASE: Yes, the authors are Tan, Steinbach,

Kumar. And I'll show you where that's up on the webpage. Okay, so the first thing to talk

about is the webpage. It's www.stats202.com. If you ever forget this, if you remember my last

name and look up my last name on Google, you'll find my homepage and then there's a link from

my homepage. But basically, stats202.com will have all the information. Now, again, this

is the information for the students at Stanford but it will be relevant to you. And so, if

you just go see www.stats202.com and so, this is what the webpage looks like. Now, some

things are not relevant to you. You donít care about the current grades, right? Those

are the grades for students at Stanford. Homework and exam solutions you might care about, right?

Because I'm going to give homework assignments, if you want to play along and do them, you

know, the solutions then will be up there. Obviously, we're not going to grade them but

there might be, you know, something extra you can do. And then the homework assignments,

like I said, they'll be linked from here, if you want to do them, it's up to you and

I will be posting solutions for the students at Stanford. Lecture 1 is linked here; these

are just the PowerPoint slides I have, so those will be up there. And then probably

the most important thing on the stats202.com webpage is the course information. If you

click there, you go to this pink page and it has--so, you know, donít use this e-mail,

use my Google email, of course. Also, you know the e-mail. Let me--in case, any of you

donít know it, let me write it down. This is datamining, one word, no underscore, 07@google.com.

That's the e-mail I set up. You should have got an e-mail to that already if you signed--I

signed up basically everyone on the Trix spreadsheet, so you should have got an e-mail to that already.

If you're not, it's public; you can go to mailman and add yourself to that. Phone, thatís

my cell phone, it's in [INDISTINCT] office hours donít pertain to you. TA, donít bother

the TA, webpage, stats202.com, okay. Okay. Yeah, he would get really confused. Okay.

So, the textbook, we were talking about there are the authors' names right there, Tan, Steinbach,

Kumar, "Introduction to Data Mining." Like I said, I think it retails somewhere between

$60 and $80. So, go ahead and get yourself a copy of that or find someone else who's

going to have a copy of it and agree to share it. So we are going to try and get some. But

it will have to be a lottery because there's no way we have enough for everyone that's

going to be in the room. Course description. Okay. So this is the Stanford generic catalog

description, so "Data mining is used to discover patterns and relationships in data. Emphasis

is on large, complex data sets such as those in very large databases or through web mining."

Topics are going to be decision trees. We will talk about neural networks. We'll talk

about association rules which if you're coming from a stats background like I am, that's

something new. We will talk about it. Clustering, you've seen before, no doubt. Case-based methods

and data visualization. And then we're going to basically follow the textbook pretty closely.

So, first chapter is introduction, just sort of a soft introduction, I'm going to go over

that today. Second chapter is on data, basically types of data, importing data, caveats about

data. We'll talk about that for about two lectures. Chapter three is exploring data

and for those of you who know me, I love to make plots of data and so I think that's very

important even though a lot of people think it's trivial. So, we'll spend, I think, at

least three lectures on chapter three talking about different ways of summarizing data through

graphs and tables and chapter 6, then association analysis, basic concepts and algorithms. I

have a break right there because that's when the students at Stanford are going to be taking

a midterm. If you want I can, you know, e-mail you guys the midterm if you want, you know,

to sort of quiz yourself. What it might mean practically for us is before chapter--between

chapter six and four, we might have a day where we donít--where we donít meet or we

might use it as a catch up if, for some reason, we donít get through everything because we

are only meeting for an hour whereas, at Stanford, they're meeting for an hour and 15 minutes.

Chapters four and five are both on classification. That's sort of one of my favorite areas so

we're going to spend, you know, good amount of time on chapters four and five. And then

finally we'll finish with chapter eight which is the cluster analysis. Evaluation, you donít

care about either. The late assignments, you don't care about; technology, you do care

about. So, basically we're going to be using R and Excel. Okay? So, if you have a PC that

sort of makes your life easy because Excel is probably installed on your PC and R is,

of course, a free download that available for PC, it's available for Mac. And, you know,

there is an R user's e-mail list; maybe I'll send that around to you and with a link for

how to uninstall R depending on what your Linux platform is. I donít really keep up

with it because I tend to use R more on my Windows machine. But I know that we have installations

for Linux and I just--I havenít really kept up with it. So, maybe I'll try and send around

a pointer to you guys for that. I'll run through today briefly how to install R on Windows,

and then maybe from there, you can sort of extrapolate and figure out how to install

it on Linux. But mainly, we're going to be using R with a little bit of Excel, which

Excel, for those of you who arenít familiar, is just a real simple spreadsheet application

that works for all the small data. Academic honesty, you donít care about. So that's

all the--that's all the information on the webpage. So, go ahead and use, you know, stats202.com

as your reference for things in this class. Just remember that the webpage is designed

for the students of Stanford so the obvious things, you know, donít pertain to you. And,

you know, for example, right, donít e-mail me at stanford.U, e-mail me at @google.com.

I think that's all I wanted to say about the webpage. Are there any questions about anything

I said so far about the webpage? Yes. >> This is an undergraduate class?

>> MEASE: It's a master's level class but it's an intro class and there's upper--there's

a higher level class. There's a 300 class for those of you who are familiar with the

Stanford curriculum. So, it sort of necessarily keeps this at an intro level, which is--which

I think is good for us because a lot of us are sort of, you know, this is our first time

seeing some of this stuff. If this is isnít your first time seeing some of this stuff,

that might, you know, you might think, "Well, this might be too basic for me." So sort of

pick and choose when you come or what lectures you watch. The lectures are being videotaped;

they are available on Fish. So, those of you who donít want to sit here, would rather

just sit on their PC and watch it there, sit on your machine and watch it there, then you

know they are going to be up on Fish. Any other questions about anything I said so far?

Okay. So moving on, so the textbook, again, we'll start with chapter one, it's just a

real soft introduction to what were going to be doing in this class. Well, this is sort

of interesting. I--when I said I was going to teach this class on data mining, the first

thing my officemate asked me, you know, he said, "Well, what is data mining?" I said,

"Well, I'll be able to tell you that by the time I'm done teaching the class." Well, hopefully,

you know, by the end of today, we'll be able to say something intelligent about what is

data mining. So, the definition in your textbook, it says, "The process of automatically discovering

useful information in large data repositories," and there's many other definitions. So, let

me just sort of dissect this a little bit and sort of, you know--I come from stats.

The question is how is data mining different from statistics? Well, I think the easiest

one, right, is the notion of a large, right? The fact that the data set is large. So, one

way you could define data mining is, well, it's statistics with large data sets. Okay.

But there's more than that, right? There's this idea that I'm automatically discovering

useful information. You know, again, what does automatically mean? Well, you know, you're

not going to write a script thatís going to do all the analysis for you and tell you,

"Hey, I looked at your data, and you know, you should be aware that, you know, there

is problem with this variable," or "There's something strange going on here," right? It's

not going to be completely automatic, but it's sort of more automatic than statistics,

right? So in statistics, you might say, you know what? I really want to analyze these

two variables and see, you know, what the correlation is between them, blah, blah, blah.

In data mining you might say, look, I have a thousand predictors in this data set; I

want to look at all parallelized correlations and I want to get an automatic e-mail every

time two of those correlations goes above a certain value, for example. So, on some

level, it's more automated and--than stats but, of course, itís not like a magic thing

that does all the work for you. And then the final aspect I was just going to mention is

discovering useful information, right? I mean, obviously, there's a lot of data out there

and the goal of data mining is to see if there's anything useful there. And actually, one last

aspect to this definition I want to mention is the last part where it says, "The data

is in large data repositories," right? So, it doesn't just say large data sets, it says

large data repositories. So, you think of a repository as some place where data just

accumulates, right? You didnít necessarily collect it. It's just there, right? So, web

logs are an example, right? I mean, the data is just there; whether or not you're going

to get any use out of it, it's up to you. Like, a whole bunch of other examples will

have credit card transactions, supermarket data. The data isnít really being collected

for any specific reason, but it's sort of hard to not collect it. The data just sort

of accumulates naturally. So, the question is, given that all this data is there, can

we find any useful information in it? And that's quite different from statistics where,

in statistics, you often say, you know what? I'm going to go out and collect data specifically

to answer a specific question. Whereas in data mining, you're accumulating the data

and the question is, "Can I find anything useful in this--in this data?" Okay. So then

I say there are many other definitions. And on the next slide I say, so find a different

definition and see how it compares to the previous slide. So, this is sort of a fun

exercise to sort of look through and see what other people say is data mining. And the first

thing that you'll notice is that, you know, the authority, right, Wikipedia, it doesnít

give one definition; it gives two, which already suggests to you that there's some, you know,

non-uniform standard for what is data mining. So, the first definition is "Nontrivial extraction

of implicit, previously unknown, and potentially useful information from data," and the other

is "The science of extracting useful information from large data sets or databases." So, that

second definition which--I think thatís a stat reference, right? Yeah, David Hand. So,

thatís very similar to what we had. The idea is that you're looking to see if there's anything

useful. The data set is large and, you know, basically, it's the art or the science of

extracting that. The first definition isn't too different. Just as potentially useful

information from the data, a little bit of an omission that sometimes we're not going

to find anything useful. It does say here, the first one, nontrivial, and I'll talk about

that in a second. There's a lot of tasks that you could say, look, I'm extracting useful

information from data in an automated way, but it's sort of trivial, right? So data mining

deals with what we'll call nontrivial. And I'll give you some examples, in a second,

of what I would consider trivial and nontrivial, and your textbook talks about those. Other

definitions, so you can sort of see--I think there's a few I clicked on earlier. What is

data mining? It says here, "Generally, data mining, sometimes called data or knowledge

discovery, is the process of analyzing data from different perspectives and summarizing

it into useful information," sort of not that good of a definition. I think this one was

pretty close to what we were talking about. Let's see. There's a "What is data mining?"

somewhere down here. Yeah, "Data mining or knowledge discovery is the computer-assisted

process of digging through and analyzing enormous sets of data and then extracting the meaning

of the data." So you see, the digging through is sort of carrying on the mining analogy.

There are a couple more. Maybe I'll show you one more of these that I thought was pretty

good. What does this one say? Data mining is what? "Analytic process designed to explore

data, usually large amounts of data, typically business or market related, in search of consistent

patterns and/or system relationships between variables." I think this little parenthetic

statement, typically business or market-related is telling. I mean, we're looking at it from

a point of view of, you know, we're--most of us are computer scientists and so, we're

looking at it from more of a science point of view. But it's really things in industry

and the market that has driven data mining. That's really where the phrase comes from

and it's her--one of these, you know, catchy, trendy words that like, "Oh, that company,

you know, my competitor is doing data mining and I'm not, so they're going to beat me."

So thatís, you know, that's where a lot of it comes from it. You know, if you're cynical,

it really is just, "Well, it's statistical techniques or it's machine-learning techniques

or it's, you know, these techniques with a new word put on them." But, you know, it's

sort of--that's how things in business and market get popularity; someone attaches a

word to them. And so this is basically the word that's been attached to--again, as your

textbook says--process of automatically discovering useful information in large data repositories.

Now, I'll say this on the side. So, you know, like I said, I come from statistics and my

officemate has the same background as me, and he said, "You know, all you really have

here is two ingredients, you know, to make a disaster, right? You have a lot of data

and you don't know what you're looking for." He said, "You're only--you're only going to

get yourself in trouble." Well, you know, you can and you cannot, so we'll talk about

some caveats that you have to be careful about. But generally, this is the feel of it, you

have a large data set and you're just looking to see if you can find some useful information

there. And what he warns about getting yourself in trouble is you need to make sure it really

is useful and you're not just telling yourself some story that's completely artificial. So

I mentioned to you some data mining tasks aren't really data mining tasks, right? Sometimes

you think you're extracting useful information from a large data repository but it's not

really considered data mining, and that's because it's too trivial, right? So here on

the left side of the screen, you have some data mining tasks. On the right, you have

some non-examples, right? So, for example, looking up a phone number in a phone directory;

well, that's extracting useful information. If you want the phone number, it's useful

to you. The phone directory is a large data set, so you're extracting useful information

from a large data set, but it's not considered data mining; it's too trivial, right? An example

of something that would be data mining would be suppose you have the phonebook and you

start to look for relationships that you previously didn't know. So, for example, it says here

you see names like O'Brien, O'Rurke, O'Reily occurring more in the Boston area. And you

say, "Oh, I didn't really, you know, know that but it makes sense to me now because

I sort of know, you know, how the different, you know, people settled in the United States

and I know there's a lot of people, you know, of certain descent in this area so it makes

sense to me." And if you say, "Well, that's--you know, I knew that already," imagine, you know,

giving it the phone book from India or from, you know, Brazil or a country that you're

not familiar with and you donít know any of these names and you start to see how they

cluster in the different regions, you know, youíre learning something about the data

without really knowing what youíre looking before going in; you start to see this grouping.

So, that would be an example of data mining. On the right here, the second one, Query,

a web search engine for information about Amazon. Okay. I'm getting useful information

from a large data set, right? The web is large data set. But again, that's not data mining;

something that would be data mining here on the left, grouped together similar documents

returned a search engine according to their context. For example, if you thought about,

you know, drawing a picture of all the web pages that come back for query Amazon, and--sorry,

if you can't hear me--you start to see two groups, right? You start to see--here's a

group over here, and here's a group over here. Okay? And you say, you know, how are these

groups--how are these groups, right? Well, maybe, you know, users that query Amazon go

to these pages or they go to these pages, but there's very few users who query Amazon

that go to a page here and a page here, right? So these are connected and these are connected

but they're very split. So what have you learned? Well, you've learned that maybe Amazon has

two different dominant interpretations. So presumably one is the retail site and the

other one is the river. And you say, "Well, I knew that already. Hang on a second. I knew

that already," you know. But imagine doing it in a language that you didn't know already

or imagine having some automated process that would tell you one query has two dominant

interpretations or one and only has one main interpretation. What was your question?

>> Just when you write on the board, if you had a black marker, it would be easier.

>> MEASE: Yes, if someone can toss me one. I don't really have one.

>> There's one right under the podium. >> MEASE: Where? See? I have to search for

it. Okay. Okay. Not that ornery. All right. So imagine those are black. Okay. So, that

is--that would be an example of data mining. And that's actually clustering, and we'll

talk about that specifically as an example clustering. Okay. So why mine data? So there's

the scientific point of view. Now, I'm going to talk about the scientific point view and

the commercial point of view. Both of these basically have this flavor like I'm collecting

the data anyway, so there might be some useful information in it. And from a scientific point

of view, you're collecting lots of data. Examples would be a satellite that has sensors on it,

telescope that look across the sky, micro rays. You know, with gene expression data

was sort of trendy a few years back. Generally, you know, any simulation you do, you can generate

lots and lots of data. I don't need to tell, you know, you guys about collecting lots of

data. Traditional techniques are infusible, and so data mining might be helpful in sort

of classifying and segmenting data or informing hypotheses. So that's sort of the scientific

point of view. The commercial point view, you know, again, the commercial point of view

is really sort of the driving the data mining on some level. Data there is being collected

from web data, from e-commerce, right, any time you use a search engine, I don't have

to tell you, any time you buy something from a site online, any time you go to department

or a grocery store, any time you use your--a bank or a credit card. So the data is just

there. Computers are cheap where you can't say, "Oh, we can't afford to store that much

data." No, storage is cheap. "Oh, we can't afford, you know, to analyze that." No, the,

you know, processing is cheap. And then your competitor is doing it, right? If you donít

want to do data mining, well, you can be certain that your competitor is--and if it's giving

them any edge, well, you know, you're going to get beat out eventually. Now, the one thing

I wanted to talk about on this slide which I always thought was interesting, the grocery

store example is sort of the very--it's like the classic defining example of data mining,

which is that when you go to the grocery store, they have a record of, you know, you bought

eggs and you also bought diapers or you bought milk and you also bought beer, you bought

chips and you also bought salsa. So they have a record of that. Now, the funny thing is--you

should think about is how do they have a record of that? Right? So if I go to the grocery

store today and I bought chips and I pay cash, right? Suppose I pay cash, I donít use my

credit card; I'm trying to sort of be off the grid, right? So I pay cash and I get the

chips. Then tomorrow, I go, "You know, I forgot the salsa," so tomorrow I go and I buy salsa

and I pay cash again. What they don't want to know just that one person bought chips

yesterday and one person bought salsa today. They want to know that it's me. They don't

just want to know what I bought. They want to know who I am. So, the question I'll ask

you is how do they know who I am if I don't pay with my credit card?

>> [INDISTINCT] >> MEASE: Yes. You have the little, you know,

your Safeway, save six cents on gas, right? You have your little Safeway cards. So, you

can think about it, you know, that sure, they don't get the data for free but all they have

to do is give you a little card and let you save three cents on every purchase or whatever

it is, and now they get all the data in the world, right? And, you know, you can opt in

or opt out. You don't have to use the card. If you really want to--you know, don't want

people spying on you, you can just not use the card. But it's not hard for them to get

the data. And once they do something like that, they have the data. So that type of

supermarket data where they know each customer, at least their ID, and what they bought is

sort of one of the classic examples of data mining, data sets, you know, where they use

that data to discover relationships between, you know, people who buy this product usually

buy this product. Now, what does that mean for the grocery store? Well, you know, use

your imagination. If they often buy these two products at the same time, maybe they

should put them in the same aisle. Better yet, maybe they can say, "Look, let's close

down the whole supermarket and just sell these two products because we know, you know, if

we just stock those, we can make this much money, things like that." So, anyway, they

have--they have that data because they give you a little discount card. And they give

you discount card for other reasons too. So, this is sort of a fun exercise. I knew I was

going to give this one today so I started thinking about it as soon as I woke up. So

I'll give you four examples. It says here, give an example of something you did yesterday

or today which resulted in data which could potentially be mined to discover useful information.

Okay, so in black, I will write here the four things that I thought of and see if I can

get you guys to give me some others that I haven't thought of. So, I just literally went

from the time I woke up--actually, I went from the time I woke up to time that I started

teaching and came up with some examples. So, the first thing when I wake up, the door on

my apartment doesn't have a key lock. It has this card, right? Little light card and it

goes beep when you--when you open it. So you think, "Well, they're not going to keep that

data, right?" What do they want--why would they want to know that data? Why would they

want to spy on you that much? Well, actually when I moved in, they told me. They said,

"Don't use your card to try and open someone else's apartment door because we'll let--we'll

come after you, you know. We'll get mad at you." And I thought right away, I thought,

"That's kind of weird, right? I mean, what if I just take the elevator to the wrong floor

and, you know, I'm half awake, right?" But, you know, presumably, they're keeping that

data around or at least they have some sort of alert system. So, you wouldn't think it,

but--you know, I'll call that my apartment door. You know, presumably, that data is sitting

around somewhere where they know what tenants tried to open what doors at what time. And

if there's any useful information there, you know, they can use it. What would you use

that information for? I don't know. Maybe they want to hire a security guard and they

want to know what sort of traffic, whether people are coming and going. Maybe they do

really want to spy on you. I mean, they can use the data for whatever they want, right?

And you're consenting to it because you're the one using the card to open and shut your

door. You're the one living in their apartment. Okay. So then after I open the door, what

do I do? Well, I go and I, you know, I hit the elevator button. Now, that one, I'm not

really sure if they're keeping that data around. But I kind of--you know, I wish they would

because maybe if they had some intelligent system, I wouldnít have to wait so long for

the elevator because, you know, what's it doing down there on the basement when everyone's

sleeping? They know that, you know, it should be setting up at the top. Okay, after I go

in the elevator, then I--the parking garage has another thing but that's the same as the

apartment door. As soon as I get one Guadalupe Expressway, there's metering lights. And I

wish that they would use the traffic sensors, you know, to do something better about the

traffic, right? So presumably, they could know that they donít need to turn the metering

lights on 87 when 101 is moving so quickly. So, you know, they could mine that data too.

They know who's driving on the highway--well, they donít know who's driving. They know

how many cars are driving on the highway at what time. They donít really know who you

are, although if you had, like, the FastTrack going over the bridge, they would know who

you are, right, because it's your fast track. And then finally, when I get to Stanford,

they donít give me, like, a nice parking pass, so I have to put money in the--in the

pay parking machine. And how do they know who I am? I use my credit card and I use the

same credit card every time. So, these are--you know, none of these are related to Internet

applications. I'm trying to, you know, be a little bit creative. All cases where I'm

producing data that someone could be using to do data mining, you know, and they're not

trying to spy on me; it's just I'm giving them the data. It's freely available for them

to use for whatever purpose they want. So, this is what? In-class exercise number two,

I call it. So I gave you four. How about you guys give me four? Yeah, Charles?

>> [INDISTINCT] stuff down Micro Kitchen. >> MEASE: The Micro Kitchen. They run out

of data, right? So they have to restock the Micro Kitchen. They run out, so they know

what we're eating and what building we live in, right? Okay. So, thatís--yeah. So presumably,

someone is looking at this data in the Micro Kitchen. Okay. One more. In green. Sorry.

>> Yeah, the government's tracking wherever you go through your cell phone.

>> MEASE: Cell phone, right? Not only--not only--right? Yes. So you can turn that off,

right? But not only do they know who you are, who you called, what time you called, they

also now know where you are because they have that little, you know, GPS location sensor

in there. And, you know, I donít know. If you're a paranoid person, this isnít a good

exercise for you. But, you know, the data is there. You know, they could ignore it if

they want, but it's there and they might find information in it. Okay. Another one in the

front. >> Google badge [INDISTINCT]

>> MEASE: My badge, right? So someone asked me this one time. So let me--let me just say

this is my badge, right, which is--oops, B-A-D-G-E, which is similar to--similar to my apartment

door but this is, you know, an employer, right, who might have a little more interest in who

I am and where I'm going at what time. And I have--when people ask me--I donít know

if they ask you this, when you tell them you work at Google, they say, "Well, what time

do you start work?" And you say, "Well, it depends what time I wake up," and things like

that. And they say, "Well, what time does your boss tell you have to be there?" Well,

you know, whatever. And then they say, "But certainly, you know, they know when you scan

your badge in and they keep track of that," and I'll go, "I guess they could," right?

But, you know, knowing Google, they're likely to use that data but not, you know, to spy

on us; more so to sort of just keep statistics and just know when they should stock the Micro

Kitchens, right, or know when they should serve breakfast. Okay. So let's get one more,

one more. Yeah. >> That's [INDISTINCT] probably know, we know...

>> MEASE: Yes. >> ...where you are and all [INDISTINCT]

>> MEASE: Yes. >> [INDISTINCT]

>> MEASE: Yes. So... >> What data you are transferring.

>> MEASE: Laptop. Yes, one time--well, that doesnít say P. Laptop. One time I was using

a computer somewhere in an office at the university, and the guy called me and he said, "Why are

you VPN? Why are you using a VPN connection?" And I thought, "Who are you to ask to me why

am I..." but he was the administrator of the network, right? So, yes, any time you use

a computer, people are getting lots of information about you. And you know we have web logs or

Google--Stats202.com, we have web logs from that. So one of the things we're going to

be doing is playing with the logs for that and it'll be cute because we can see certain

spikes when certain events happen and I can see what webpage you go to. I donít know

who you are but I know your IP address. So anyway, you know, you can think of loads of

examples here of different cases where you're producing data that, if someone wants to,

they can mine, they can use it to get information about and help them to make different decisions.

Okay. So where does data mining come from? So, you know, this--you can tell this book

is sort of a statistical book because you see statistics and they you see everything

else. What's everything else? Well, you have artificial intelligence, you have machine

learning, you have pattern recognition, and some people sort of put data bases and things

like that and information retrieval there too. But, you know, it is sort of like we're

teaching--or I'm teaching this course from a statistics point of view, but it's not just

statistics, of course. It's borrowing ideas from artificial intelligence, machine learning,

pattern recognition and all those. Traditional techniques, when we say they have traditional

techniques in the second bullet, it's traditional statistical techniques would be unsuitable.

Why? The data is large, not just large like a lot of observations, but large, it's high

dimensional and heterogeneous and distributed, right? So, there are sort of new challenges

for statistics. We coined this phrase "data mining" but we're borrowing information from

or we're borrowing ideas from all these other areas too. Okay. So, the book breaks down

into two types of data mining tasks, and this dichotomy is a little bit forced in some cases

but I'll walk through it. So, they differentiate between predictive methods and descriptive

methods. And let me sort of write the shorthand version of these. The one thing to remember

is I guess descriptive methods donít really have one right answer. You sort of know if

you found something useful because it's useful but you really never know exactly what you're

going after whereas predictive methods, you're going to look at your classification accuracy

or your precision and your recall, so those are straightforward. So predictive methods,

this is predictive. What do we have here? It says "Some variables to predict unknown

or future values of other variables," right? So basically we're trying tom predict future

in some sense, right? We're trying to use some inputs to predict the future of classification

of some output. Whereas, descriptive methods, descriptive, for that, we're just basically

trying to find patterns in the data. Okay. Find patterns. Okay. So, you know, the way

to remember this right is sort of this is the supervised learning and the unsupervised

learning, if you will, right? So the example I would--I would give you if you think about

the Amazon, right, with Amazon I found a pattern, right? There were two distinct types of pages

about Amazon. There is like the commercial Amazon and there was the river Amazon, so

I found the pattern. Okay. Suppose, alternatively, that I already--so that would be descriptive.

I described the pattern, I described that there were two groups of the Amazon webpages.

Predictive would be more like, I know that there's two types of Amazon webpages and I

know there's like the--one's about commercial site and I know there's one about the river.

Okay. I know that there's two groups, but can I predict given a new one, given a new

webpage, right, can I have a computer algorithm that will predict which one of these two classes

it falls in? And the way I am going to measure success there is how accurate am I going to

be able to predict this. Right? What is my misclassification rate? Am I going to get

90% of them correct, 95% of them correct, et cetera? Now, it's really easy for a human

to read the webpage and say, "Oh, this is about, you know, Amazon the retailer or this

is about Amazon the rainforest." But, you know, can a computer use the human labeled

observations to get a pretty accurate rule? Thatís predictive data mining, whereas, again,

I told you descriptive data mining was just finding the fact that there's two groups in

the first place. So the topics that we're going to cover fall into these two categories

as follows. So, the book talks about classification and regression as both being predictive. So

let me make this--so we'll put here classification and regression. Both of these as being predictive.

Now, classification, we're going to cover in chapters four and five. Regression, we're

not going to cover in this course. However, if you take a regress--oh, sorry, if you take

a stats course, they're going to cover regression. Really, the main difference between these

is classification, you're trying to say, you know, I said I'm 90% accurate, right? I have

two classes; it's either Amazon the rainforest or Amazon the river right. Or you could have

three classes or four classes or any number of classes and you're trying to see how accurate

you are. Regression is analogous to that but instead of trying to predict the class, you're

generally trying to predict a continuous attribute, right? So let me give you an example, right?

So it says to change from a web application. This is a book, this is an eraser. You can

these apart, right? Suppose you send like sonar signals, right, and bounced the sonar

signals off the book and the sonar signals off the eraser. Well, they're going to look

different, right? And so classification would be to use those sonar signals and some labeled

instances--some labeled instances of the book, some labeled instances of the eraser, and

predict for new cases whether it's a book or an eraser, right? Thatís classification.

I'm trying to predict is it the book class or is it the eraser class, right? Just like,

is it the Amazon rainforest class or the Amazon web--commercial class? Regression would be

more like, can I use the sonar to predict the size of the book? Right? So you donít

just either get it right or wrong. If you say the book is 11 inches tall and it's really

10.5 inches tall, you're off by exactly .5. So, in classification, you're basically trying

to predict what class it is, whereas regression, you're trying to predict some continuous attribute.

And so you would measure your performance a little bit differently. Classification,

you might use recall, precision or misclassification rate. Regression, you might use like squared

area of loss, L1 loss, some sort of lost function like that that measures how close you are

to the target. Where--again, we're not going to cover regression. Classification is more

common than data mining, but regression gets a lot of attention in classical stats courses

and also a lot of the classification techniques you can extend to regression; we're just not

going to get in to them. Okay. Then for descriptive, the visualization we're going to cover in

chapter three, association analysis in six, clustering in eight and anomaly detection

is in chapter ten, although we're not going to get to it. So, let me just say a few words

about these. So, visualization is in chapter three. Visualization, as I said before, it's

one of the most important things you're going to do. If you think about writing a report

or doing some study, people are going to remember the picture, right? If you can't tell it with

one simple picture, you probably havenít really said anything interesting. And there's

sort of an art in making good pictures and making pictures that can see clearly. And

so, we're going to spend a fair amount of time in chapter three just talking about differently

ways to visualize data. And visualization can be two purposes, right? One is to present

to someone. Okay. You know what you want to say and this is just a good way to present

it and the other is to learn something yourself. You donít know what you're looking for. If

you're just going to look at a bunch of pictures or if you only look at one picture that's

going to tell you what's going on so you can discover. Both of those are visualization

tasks. They're both descriptive, and we'll talk about those in chapter three. Association

analysis. Association analysis, this is something that doesnít really make it into mainstream

stats too much. We're going to talk about that in chapter six. This one is the classic

supermarket one, right? The people that bought--the people that bought diapers often bought beer,

right? The people that bought chips often bought salsa. This is a type of association

analysis and we're going to talk about that in chapter six, in particular, that mark--basket

example I just talked about. Clustering, chapter eight. Clustering. Okay. The clustering example--the

canonical example there would be like the Amazon search engine versus the Amazon Rainforest,

right? You see two distinct groups emerge, you know something is going on and, of course,

that example is trivial, but suppose I give you a query in a language you donít know.

Can you tell me, you know, what sort of pattern there are in those web pages? Are there two

main interpretations? Is there one dominant interpretation and one slightly less common

interpretation? So you're just sort of looking for patterns in the data, and grouping is

one pattern, and that type of grouping is called clustering. News stories, right? Can

you--can you group together different news stories? These are about sports, these are

about politics. Can you see different groups emerging in the data even without having labels

on them? So, it's unsupervised. It's clustering. And then finally, anomaly detection, we're

not going to have time to get to but you might want to read about it in--only one L, right?

You might want to read about it in chapter 10. When we do chapter three, we'll do some

of it because when we make pictures of data, sometimes thatís exactly what we're looking

for, things that are strange. Anomaly detection is probably, you know, as I say, association

analysis is the one of the classic examples of data mining. Anomaly detection is one of

the ones that gets all the press because--shoot, I had a news story, I donít know if I can

find it. That, you know, you always see data mining in the news because they're using it

for purposes of, you know, credit card fraud detection and they're using it to find terrorists,

right? And both of these things are anomalies, right? So, what is credit card fraud, right?

Someone--you know, you have your credit card and all of a sudden, you spent a whole bunch

of money in a place that you've never been before. Thatís an anomaly. Their credit card

flags it. The sooner they can flag it, the more money they can save. So, thatís anomaly

detection through credit cards. With respect to terrorists, what are they looking for?

Strange behavior, right? Something thatís indicative of a terrorist. Now, you could

argue, well, in that case, maybe this should be up here because you're trying to see how

accurately you can, you know, spot the terrorist. But, you know, thatís why I said this line

is a little bit blurry. But your textbook tends to classify anomaly detection as descriptive

because you donít really know exactly what you're looking for. Okay. Thatís all chapter

one notes I want to talk about. Now I'm going to talk a little bit about the software. But

let me stop and see if there's questions. Question?

>> So, we end up giving yet more data to the data mining mill because everytime we're planning

a trip, we have to inform every credit card company that we will be spending stuff abroad.

>> MEASE: Right. Yeah. So the question is, if you donít want your credit card to sort

of, you know, call you and cancel your card because they see something weird, some people

will call the credit company ahead of time and tell them, "Look, I'm going to be traveling

overseas," but then, the point is that they can also use that data to, you know, feed

into the data mining framework. Yeah, some people will do that. Some people, every time

they're going to travel, they'll let their credit card company know ahead of time because

they donít want any problems. Actually, I have a friend who is a pilot who uses cash

only which is surprising in this day and age. But for that very reason, he doesnít want

to call the credit card company every time he goes somewhere. And, you know, he does

fights that look as though they're anomalies, right? So, anyway, are there any other questions

on anything I said so far before I talk about software? Question?

>> So what is the difference with clustering, classification [INDISTINCT]

>> MEASE: Okay. So what--your question is what's the difference between clustering and

classification? Okay. So, they're very similar. Clustering is unsupervised. Classification

is supervised. So the thing is--let's see. Let's go with the web page example, right,

with the Amazon, Amazon, all right. In one case, I have all the label--all the two instances

labeled. This one is about the rainforest. This one is about the e-commerce company.

And I'm trying to predict for a new observation which one it is and I'm going to measure how

accurately I'm doing. That's classification, thatís predictive. I want to see how well

I can predict a new observation into these two classes. Okay. Clustering, on the other

hand, is the act of actually discovering that there are two classes, because I didnít know

that ahead of time. I was just looking at a bunch of different queries, looking at how

things grouped together, and I saw two distinct groups emerge for Amazon. Now, the clustering

is you donít really know you're right. You know, are there really two groups. Well, in

this case, you do. But you're just sort of trying to discover a relationship. So, does--is

that good? Is there anyone else that can give a better definition than I just gave? Because

a lot of people in this room that are experts on this machine learning and they can tell

you supervised learning, unsupervised learning and--but thatís sort of my take on it. It's

a little--it can be a little blurry especially after you do the clustering if you say, "Oh,

I really did learn something that was right." That, you know, it tends to have a little

bit of a classification feel, but that should help. Okay. Other questions? Yes.

>> This anomaly, itís actually the same as the [INDISTINCT]

>> MEASE: Yes, to a large degree. To a large degree. And there are some subtle differences

and you can--you can read about that. But, yeah, generally an anomaly is an outlier and

vice versa, generally speaking. Yeah. An outlier, you can sort of see the word outlier, something

that lies out a rest from--away from the rest of the data, an outlier. So in some space,

an anomaly is an outlier, but key might be to find that space. Other questions about

things that I have said so far? >> Is an anomaly like unsupervised learning

an outlier? An outlier pertaining any actual [INDISTINCT] cluster.

>> MEASE: Right. Right. So Charles was making a point about the relationship between outliers

and anomalies. I donít want to get too much into that distinction, but yeah, there's a

relationship and some subtle difference. Other questions on what I've said, anything I've

said so far? Okay. So, let me see. How are we doing here? So, okay, we're doing good

on time. So, like I said, Stanford, this is an hour and 15 minutes, but here, we're trying

to stay under an hour, obviously, so I go a little bit faster and skip a few things

that donít matter for you guys. Okay. So, what are we using in here? We're using Excel

and we're using R. Now Excel, if you have a PC, you're in a good shape. If you donít--you

know, I donít know. Trix probably won't give you everything you need. You know, no offense,

you know, but it's just not the same product. Open Office might give everything you need

but if you have a PC, you're in good shape. If you have Mac, I'm sure Excel is installed

on there. If you donít have either of these, if you're sort of a--just a strict Linux user,

I donít know. I could--I can't really speak to Open Office but it might--it might get

you through most of it, but we are going to using Excel not primarily, right. Excel isn't

very powerful, it doesnít handle large data, it's very slow, it's--you know. But for some

purposes, it's good. And it's--it is good to sort of have a spreadsheet application

sort of that you're comfortable with because sometimes you can do things very quickly that,

you know, you donít really want to take the time to strip up. So we're doing some things

in Excel, and so you should have access to that. But then primarily, we're going to be

using R, which is free. If you have Windows, I'll go through how to install it on a Windows

machine right now. The same installation instructions will generally hold for Mac. And for Linux,

like I said, I'm going to try and send you guys a link to something on the--something

I can get from the R users which talks about installation. But different people do different

installations depending on what they're doing and so, I havenít really kept up with it.

But let me take you through R and give you a little preview of that and show you how

to get it installed on your Windows machine. I have a Windows machine right here, obviously.

So, I'm going to--as I go through examples, I'll be doing it on the Windows machine. So,

you know, you might, if you have a Windows laptop, just install R on that and use that

for going through examples. Also, it's easy, you can bring it with you and you can sort

of play along as you're sitting here. Okay. So, how do I download R? So you go to this

web page, right? It's sort of a little bit tricky. For a while on Google if you just,

you know, queried R, you wouldn't get it. I think you can now, but let's just go to

CRAN. Let's see. Here we go. Okay. So, we're going to be good with--you know, I have a

Windows machine, so I just hit Windows 95 or later and then, Base. So, it's open source

and different people contribute different packages. If we need anything special, I'll

let you know. But for now, Base is going to do the trick. And then if I go down to this

one here, this is a self-extracting EXE file. It says I have to get it from a [INDISTINCT].

But it turns out if I click on this right now, the behavior is itís just going to give

me one and I can just save it. Then you double click, go through all the defaults that everything

as it is is going to be fine and it will get you R. And once you do all that, you can see

what it looks like. Here. Let me--let me just run through those screenshots again. So these

are up on the PowerPoint slides if youíre--you sort of forget what I said. So, go to cran.r-project.org.

And for me, I would click on Windows 95 and later. Just click here on Base, and then 2.5.0.

I think I have a 2.4 version. The version shouldnít matter too much if it's, you know,

within the last year. You just save this to your machine and then it will just install

itself and all the defaults are pretty good. Once you do that, okay, you get something

that looks like this. So here I have actually 2.4.1 on my machine. And it's sort of command

line, right? Itís not a spreadsheet app; it's sort of command line. Let me see if I

can make this a little bit bigger for you so you can see. Let's change the font size

from 10 to--let's try 20. That's too big probably, right? It looks like a cartoon. Okay. So,

you know, it's sort of my online, right? I can do 10 + 1 and figure out that thatís

11. Okay. You have functions in here, right? So, let's see. Let's think about an easy function.

EXP is exponential E to the zero is one. Okay. So, your functions. You can look for help

on the functions. So, I want to like help on the exp function. If I type question mark

in an e function, it brings up a window and tells me, you know, that this log computes

natural algorithms, log 10 computes--okay. And it gives you some examples. So, the help

is pretty good. And you can look things up online because there's a lot of documentation.

So, it sort of has command line. You can write your own functions, you can, you know, sort

of use it as a little bit of a scripting language. But also, it's really good for plotting, right?

So, if you type like, okay, well, seq(1:10) is the integers 1 through 10. So, if I made

a plot of--well, here. Here, here. I'll show you. So, let's x--let x--seq(1:10). And if

I plot, suppose, like, x, let's say, x+10, then I get a plot, right? And you get to change

sort of almost everything you can change on this plot. You can change the plotting symbol,

you can change the font. You can change the color. You can make it quick able, you can

label things. So, the plotting in R is really good and it gives you a lot--a nice tool for

making plots very quickly. And we'll go through a lot of that when we get through chapter

three. But I think in the meantime, go ahead and make sure that you have a machine where

you can--you can use R and get it working. And next time, when I get into chapter two,

I'm going to go through some example datasets and we'll talk about, you know, manipulating

them in R and Excel a little bit. But let me just take, you know, the last minute here

and see if there's sort of any question. So, I'll just run through in case you missed at

the beginning. The whole point here is that this class, you know, I'm teaching it at Stanford,

so it's not too much extra effort for me to come here and teach it here. It is being videotaped,

so you can watch all these, you know, on your--on your machine at your desk. It's an hour--even

though a Stanford class is an hour, 15 minutes. And, you know, you can sign up on Mailman;

you're welcome to come to any lectures you want. Everyone is invited. This is the textbook.

We'll try and, you know, distribute some of these next time, but there won't be enough.

We'll have to do it by lottery. So, go ahead and buy one. Go to www.stats202.com for all

the information about the course and make sure you're subscribed to the datamining07@google.com.

And so, I think, thatís all the organizational information. We're going to meet Tuesdays

and Fridays from 1:00 to 2:00. I can't think of anything else that I might have said. Let

me just stop and ask if there's any organization questions or anything. Yes, question.

>> Can you get a larger room? >> MEASE: No, actually. I mean, I asked about

this and they said like, "Well, you can, but you have to go through another process." So

this is the biggest the Google EDU folks had and so...

>> The machine learning EDU talks had a lot of attrition like for the first lectures [INDISTINCT]

>> MEASE: Yes. So, we hope that a lot of you--we hope that...

>> Why are you looking at me when you say that?

>> MEASE: Because you're--because you're sitting on the floor.

>> [INDISTINCT] to that second group? >> MEASE: That's--I can talk to the Google

EDU folks about that. I mean, theyíve been extremely helpful. It was sort of a challenge

to estimate the attendance and we knew the Trix was an overestimate and didnít know

with the video conferencing how many people would actually want to come. And I think I

still donít know how many people actually want to come. I think we'll know more on Friday.

Think if there's still overflowing on Friday, then we can--we can really try and see if

we can do something better than having people sit on the floor. Other organizational questions?

No other organizational questions? Okay. There's free lunch in the cafeteria.

The Description of Statistical Aspects of Data Mining (Stats 202) Day 1