MALE SPEAKER: It's great to have Raj Reddy here.
Now, I was doing a Google search of all things, and I
came across an interesting fact.
In 2004, Gross, Politzer and Wilczek won the Nobel Prize in
physics for the work on strong force that
binds quarks together.
Now, I also discovered that Raj Reddy had nothing to do
with that group, and he has never won the
Nobel prize in physics.
Although he has won just about every prize in computer
science that counts, including the most important one, The
Turing award, as well as the Okawa prize, the Honda prize
and the French Legion of Honor award.
Now, so why did I mention that stuff about the quarks?
It's because I personally have this theory that Raj is the
strong force that binds together the
computer science community.
And more than anybody else, I think he's the one who has
tried to make computer science get out into the lives of
people in the developing world and help them, and help us,
help bring computing and communication
technology to them.
He has also run the world's premier
robotics lab for two decades.
and the decade after that, he ran one of the world's premier
computer science departments at CUM.
I'm from Berkeley, so I have to say one of rather than the.
He's the founder of the PCtvt project, which I'm sure he'll
tell you about, and the Million Book Project.
And for the Million Book Project, he was the first, and
I believe only, person in the history of the world to get
the presidents of India and China to write a letter and
put their signature on the same letter to get them to
agree to a project.
So he is truly an irresistible force, and we're glad to have
him on our side.
And we're happy to welcome him to Google.
RAJ REDDY: Thank you Peter.
So I'm not sure I need the microphone, but
it's kind of noisy.
I'll try to use it.
So the talk today is not about the emerging world or Million
Book or a number of other things we
could have talked about.
But I came here with a mission.
The mission I have is, Google is not doing enough of our
organizing the information about the world.
And there are lots of things you're leaving behind, and I
want to tell you about them.
And then hope that some of you, the younger generation,
will actually pick them up and transform the way we access
and use information.
So this particular talk was inspired by my colleague,
Carbonell, Jaime Carbonell, who is the head of the
Language Technologies Institute, is one of the
leading figures in language processing.
For those of you don't know ancient history, the first web
crawler and search engine was built by one Jaime's students,
Fuzzy Malden, called Lycos.
The rest is history as they say.
And that was in 1994.
And so one of the things that Jaime said was when he was
trying to explain to someone what is the purpose of the
Language Technologies Institute, what will it do or
what does it do?
He said it is to provide the right information to the right
people in the right timeframe in the right language in the
right level of granularity.
And he had a few other rights, which over the period of time
we jokingly used to call the bill of rights of the
But I kept thinking about it, and I would bug him and say,
why aren't you doing all the research that needs to be done
in this thing?
He says, that's too big.
We don't have enough money to do all the things.
Not even Google has all the money to the all the things.
So the important issue is that when you look at the world's
information with this prism and say, what does it mean in
the first place?
What research has to be done to get there, and
will we ever get there?
And if you do, how will we know whether we have
succeeded or not?
So this is the challenge.
And so I don't think I have the answer.
I don't even think I'll be presenting all
the facets of it.
The purpose of my coming here and in giving this talk is to
stimulate each of you and perhaps even have a dialogue.
I don't think we can in this kind of forum, but we might
have a dialogue afterwards.
If you come to Carnegie Mellon, which is right next
door here, Carnegie Mellow West in the Nasa park, we can
have a brainstorming session.
So that's the purpose of this talk.
And there are more than five here.
You have the right information to the right people at the
right time, in the right language, in the right level
of detail, in the right medium.
And the last one is an issue that's very important if you
happen to be an illiterate person in a village.
You don't how to read or write.
Especially, you don't know anything about English.
And so you better be able to provide the information I'm
looking for in my local language, but more importantly
using audio and video, not text, OK?
And so the right medium also becomes an important part of
the research agenda that one needs to look at.
OK, so as soon as you use those words, they imply
certain kinds of research, search engines, classification
of information into the right chunks and right formats, and
right timeframe and support for the analysis.
Machine translation, summarization, speech input
output, all of these things, or video input output too.
But they're part of the solution, but
not the entire solution.
The right information just doesn't mean searching
something and displaying a list of all the possibilities.
In the end, it may be the right information without
And the question of how do you get only the right
information, not everything else under the sun perhaps,
may be one way of looking at it.
And another way of saying it is, if you're a kid going to
school, the right information to you is knowing about the
things that were discussed in the class today.
Or if you're about to get married, knowing the
information where to get all the relevant wedding planners
and all kinds of the things or solutions for
that particular situation.
So the right information has many contexts and many
connotations to different people at different times.
So the right is very difficult to define, I'm finding.
And I don't what the right answer is, but I'm just giving
you examples of this.
The right people is again something that you can define.
If you're about to get hit by a tsunami, it doesn't matter
if you broadcast it to the whole country of Thailand.
It's the people on the shore that are sun bathing or
fishing or something that are going to be impacted.
They are the ones that need to get the information.
And by the way, they don't have
days or hours or something.
And furthermore, it is a push technology, not a pull
So you have to be able to figure out, how do you get the
information to them in ways that are not usually done
where I go and type in, is there a tsunami coming?
And then it tells you something about it.
So there are lots of very complicated, complex issues
that this set of rights that Jaime Carbonell proposed
seemed to raise.
And we don't yet have as a community--
language processing techniques community--
don't have a broad research agenda that covers the whole
spectrum of these issues.
That's the thesis of my talk basically.
So now what I'll try to do is take you through quickly what
we are doing at CMU, which is less than, as I was telling my
colleague, less than 5% of what we need to be doing.
And that's rightfully so because we don't have the
resources to do everything that needs to done.
So in search engines, the right information from the
future search engines.
How to go beyond the just relevance to query and
popularity I'll talk about a little bit more.
A second issue is eliminating massive redundancy.
For example, if you, say, type into Google, "web-based
email," it should not result in links to various Yahoo
sites promoting their email, not even non-Yahoo site
discussing just Yahoo email.
What it should just say is link to Yahoo email, Gmail,
MSN mail, a comparison between them perhaps.
That's much more what we call relevance rather than a
massive redundancy into the thing.
The second issue is harder to do, and we need to somehow
organize the world's information this way.
What information is trusted, and what information is just
pure marketing, and what information is snake oil?
Unfortunately, the web-publishing paradigm
permits all of them to have equal weight.
So we need to figure out how to begin to go towards trusted
At one point, I proposed to my friend, Bob Kahn, who is one
of the inventors of the IP protocol and the internet,
that maybe we should have a Triple A web--
authentic, archival and always available information.
So how do you make sure it's authentic?
That means you have to register your information with
some trusted society.
Maybe Google could set up such a thing.
You submit your information.
You get a check sum or something.
And then every time you access this information, if you
changed anything, it'll say, no, it's not authentic anymore
but some other information.
You're not getting the original
information that is certified.
So if anybody had changed it, including yourself, then it's
no longer the original information that was certified
This is where the peer reviewed journals are
different than web sources of information.
And peer review journals cannot be changed.
They are printed and published, and they are
reviewed, and somebody certifies that this is
And then it is there forever in that form.
They may be wrong or right.
It doesn't matter, it's there, right?
So we don't have trusted sources of
information at this point.
So I'll talk about these three subjects, maximum marginal
relevance, and novelty detection and named entity
extraction as three narrow topics within this right
information, which is what we are doing.
So most query systems simply retrieve everything that's
relevant to the keywords that you type.
Sometimes, that's not enough.
Things like novelty, timeliness, appropriateness,
validity, comprehensibility, density, medium--
all kinds of things might be also appropriate rathen than
just purely relevance to the thing.
Novelty is non-redundancy if you want to call it that.
If there are 20 different hits or 100 hits that are more or
less saying the same thing from different sources at
different newspapers at different locations, you ought
to be able to say, that's all one thing.
But let me give you five things that are really
different from each other.
And that would be a desirable outcome for the search engine.
So and there is more detail here.
I don't want to go into the detail, but I think you
So if you have a large number of things in the central
cluster there, and there lots of outliers, you want to pick
one out of each of those countries and present that
because then it'll be maximally non-redudant, right?
And that's what you're looking for.
So novelty detection is another idea that's usually
relevant in newspaper stories.
Usually, one of the things that happens in newspapers is
out of the blue, a word "Katrina" appears.
It was never there in the vocabulary before, and
suddenly it takes on a meaning of its own.
It goes on and on and on.
For a few days or a few weeks it's the top story.
And then it slowly goes away.
And it hasn't got away yet, but it will one of these days.
So the detecting of a new event turns out to be a an
important aspect of getting the right information
available in ways it is not otherwise.
Then there is a whole set of issues about how you do that,
cosine similarity, tfidf and things like that are examples
of how you might do this.
And so there are issues on how you do the first story
detection, FSD sometimes it's called.
And then categorize these topics, these words into
topics, and then see if these terms are maximally
differentiating in the topics.
And the second way of doing this is to use situated named
entities like "Sharon as a peacemaker."
So link detection is--
once you have the first story, you have detected it.
Now you need to track it over a period of time.
Why is this important?
It turns out that at any one point in time when you look at
a story like the number of people that died in Katrina--
on day one it might say seven people.
On day 17, it might say 3,000 people.
And then it maybe goes down or up just like the World Trade
Center, and the total number of people that actually died
changes every day.
So if you were trying to find out a fact of how many people
died in Katrina or the World Trade Center disaster, you
can't simply pick one of those things and
present all of them.
You need to find a way of summarizing or finding the
And sometimes it's called non-monotonic reasoning where
what was true one day becomes not true the following day.
And that becomes not true the next day.
At some point, it stabilizes, and you know a fact.
And at some point, it disappears.
You no longer have that fact, right?
So this become an interesting set of issues.
Another problem that occurs in language processing of this
kind is named entity extraction.
And this is important for the usual query type systems. Who
was mentioned, for example?
Peter was talking about the Okawa prize or Honda prize.
When I got it, I didn't know what this prize.
I got a letter saying you've been selected.
I said, what is this prize?
I've never heard this before.
But it turns out to be a very important prize in Japan, but
here in the USA, we never hear about it as much.
So finding these things like who was mentioned, what
locations, what companies, what products, turns out to be
an important aspect of finding the right information.
So here is a story.
And what you are trying to do is named entity
And there are a lot of names, and there are a lot of roles
they are playing.
And the issue is, can we have a system which identifies
different named entities and the relationships to each of
them between what it is that they're doing and how
they're doing it.
And so if you count, there are many new
techniques for doing it.
And the finite-state transducers and statistical
learning techniques are two of many.
So here are the people that were mentioned in that story,
Clinton, Kantor, Peng, Suzuki, Langford
and so on, and places.
How did you know these were people, and how did you know
these were places?
And it turns out this is not a big deal.
Once you have this thing categorized in your knowledge
base and with appropriate labels, you can do this.
But you don't always have all the relevant information.
You may come across a name that you've never seen before
or a name of a place that you've never seen before, and
it's not uncommon for many of us, when presented with a
foreign name, to know whether it's man or a woman.
And many times, we make the mistake of calling he a she
when the opposite is actually true.
So the roles that they play turns out to
be important also.
You wanted to say, who participated in this meeting?
Who is the host country?
Or who was the host?
What was discussed?
Who was absent at this meeting?
These kinds of questions come up all the time.
And it is not exactly stated in the story.
It is not explicitly there.
You have to infer it from the meaning of the language.
Most of us don't have any problem inferring those
relationships, but systems do.
So there are a number of emerging methods.
And here there is "who does what to whom" relation
And this is very useful if you're going from unstructured
data to semi-structured data like in a database sense.
If you wanted to organize the information into your table,
then you need certain tools of going from unstructured
information to structured information or
So those give you two or three topics that we are working on
at CMU, named entity extraction, novelty detection,
first story detection and maximal
relevance, marginal relevance.
So the next topic is "the right people." Again, "the
right people" is a very complicated phrase to define.
If a seven year old is working on a school project, and he or
she searches for "heart care" or "heart attack" or whatever,
then the kind of stories you might want to retrieve for
this right person would be very different than if you're
a doctor and asking the same question.
So each person, based on their context and situation and so
on, needs to be provided with a very different set of
Once you see the example, it's obvious.
But the issue of how the system or individuals--
in other words people, how would they know what the right
And most of us have the benefit of knowing this person
is doctor and that person is a kid.
And therefore, you might be able to provide the right
information, but it's not always obvious.
And then, there's a whole set of other affiliations that
also force you to select or sub-select information in a
If you're a family group or an organization group or
stockholder group or whatever, then the information that
you'll be looking for him and you might be provided would be
very different from each other.
So the right people, right information
to the right people--
and so the basic tool that seems to be needed here is to
somehow, given a text, classifying it, categorizing
it, and saying what is this about and how
does it group together?
Is it for a kid potentially?
Or is it for an expert?
So there may be many different ways you would organize this
And those ways of organizing information
is not in the content.
You can;t search for it.
It's inferred information, and that's where the
complexity comes in.
So categorization, Yahoo for a long time, in the beginning,
used to do manual assignment of categories.
And Reuters, for example, used for a long
time hand-quoted rules.
And a lot of the rest of us use various machine learning
techniques, and they're getting more and more
sophisticated as we go on.
And then the issue of whether you can do a hierarchical
event classification becomes an interesting issue.
So the right timeframe.
Now, I used to think right timeframe is, as soon as you
know the information, send it to me.
That turns out to be the wrong strategy.
I only want the information when I want it, just in time
If you send it well in advance, it'll sit there and
I'll even lose it.
That's the trouble with our education today.
You go to college, and you are taught calculus because 30
years later you might need it.
By that time, you have forgotten it so you have to go
back and relearn it.
So if I have to relearn it, why don't you just teach me
calculus in one week instead of one year, and then say, if
and when you need it, go to Wikipedia or something and
you'll learn the rest. Or, here is a learning by doing
method or learning calculus, and you can learn it
when you need it.
So that's the just in time information being provided
well in advance of when you might need it.
But this is not just that type of thing.
All of us get information about a seminar or about
something I'm supposed to do two, three weeks in advance.
If you're like me, which you probably are, you're probably
being bombarded with hundreds of emails every day.
And you just cannot keep up with it.
My senior colleague, and the only Nobel prize winner in our
field, Herb Simon, the late Herb Simon, used to say we
have the wealth of information but
scarcity of human attention.
We don't have enough lifetimes, and as human
beings, don't follow Moore's Law, right?
Our capacity has been constant for the last thousands of
years, and it's not going to change any time soon purely by
So the right timeframe--
defining it even becomes a very difficult issue.
So getting the information to the user exactly when it is
needed, immediately when it is requested, prepositioned this
whole issue of anticipatory information.
So for example, one of the areas of research some faculty
members have studied at CMU is what is called
We cache results all the time.
Computers were designed going back to the '60's, the [?
Strech for example, would pre-compute lots of things and
then pick one of the right ones.
And that type of caching of the results is common in
So the issue is if I know there are only 40 things I can
do and 38 of them are trivial and can be done
instantaneously, two of them require search or something
that may take a minute of time, and I don't want you to
wait a minute, you could pre-compute it.
But you don't present it.
You pre-compute it and you keep it.
And when I ask for it, you then give it to me.
So that's the issue here.
So things like push technologies, alerts, and
reminders and breaking news, are also
often considered harmful.
Just because the Steelers won the Super Bowl, I don't need
to know it the instant that a certain event has happened in
the Super Bowl.
I may be in the middle of a meeting.
I don't want the alert at that point.
And if and when I am ready to get it, I should
be able to see it.
So there is a whole set of things that alerts are not
always the right things to do, and that becomes an important
part of the right information.
The right language, there are lots of--
at CMU and many other places, including Google--
there are a lot of language translation tools and
And unfortunately, all of them tend to be not that good.
And it's a hard problem.
It's been a hard problem.
It'll continue to be a problem, but there's a lot
that is known about how to deal with it.
And so one of the reasons you might want to do the
translation is for multilingual search.
There it doesn't matter if the translation is not perfect
because you're not asking the people to read the
It's mainly being used as a way of indexing and to various
materials such as trans-lingual research,
Language identification turns out to be a
very interesting problem.
One of the things we're working on is this Million
Book digital library project, right?
That is well before Google print going back
to five years ago.
But China and India are partners US in this project,
and they are doing the scanning.
And they enter all the meta-data.
And the scanning is done by high school graduates at not
minimum wage, at some wage that I don't know about
because the government of China and India
are paying for it.
And I was looking at one book.
It was obviously a French title, but the guy didn't know
that and typed it in and said German.
I said, how could you be so stupid?
How could you make such a mistake?
So this young man was obviously super bright, who
should be hired by Google, came to me and brought me a
couple of books.
And one of them was in Gujariti and another one was
So it happens that even though I'm from India, I don't know
either of those languages.
I couldn't tell the difference between the two.
So he says, you're a PhD, you should know this, right?
I said, sorry, you're right.
So it turns out that just because you're educated or
literate, it doesn't mean you know all the languages and you
know all the scripts.
But it's a very simple problem.
All that you need to see are one or two words to the title,
and then you can tell, most of the time, what language it is.
It is a sparse language machine learning problem.
Most people that do language detection go and look at a
whole page of text and then do a frequency count.
And then say, oh, this is this language.
That's an easy problem.
The hard problem is, supposing I only give you three words of
the title, can you tell what language it is?
Turns out you can.
It requires a little bit more work.
So there's the issue of regular translation like the
one that you can get from Google.
And there's also what we called reading assistant or
Let us say you know a little bit of Italian but not a lot.
And you are starting to read, but you come up against, in an
email, some word or phrase that you don't understand.
You should be able to just select it and get immediately
a translation, but one that is much better than simple
translation because now it's in context.
So most of us have this problem, especially if you're
a non-native English speaking speaker and you come across a
word or a phrase, or even if you're an English speaking
If you're reading Shakespeare, and there are lots of phases
there that are not used by us today, and you need to be able
to have a translation assistant or reading
assistance which will give you that.
It's an interesting sub-problem to solve and can
be solved for all the different languages.
Transliteration also turns out to be an important problem in
the right language because many languages
have different scripts.
So it is very interesting, especially coming from India.
It turns out, in India, the sounds of all these languages
are the same no matter what language, but
the letters are different.
But in Europe, the letters are the same, but
the sounds are different.
So when Unicode was invented by the people in Xerox PARC,
they thought in all the world, all you had to do was make
sure that for the same letter, no matter in what language it
is, you have the same 8-bit code or
whatever out of the 16-bit.
But that turned out to be exactly the wrong thing for
You want to be able to say, no matter what language, it is ka
or ke or whatever the syllable happens to be.
And that's not easily identifiable in the Unicode
system as it's currently done.
OK, so there are lots of issues on right language.
There are many different techniques that have been
tried, statistical translation, knowledge-based
translation, example-based translation.
I'm not going to give you a tutorial on all of these, and
you can quickly find out about them.
Where things are going now is these so called multi-engine
It turns out no one of these techniques is perfect.
They work some of the time.
They don't work other times.
So rather than depending on the best technique using only
one, you say, can you actually build a hybrid system that
gets translations from all of them, and then figure out how
to use them.
That's the multi-engine machine translation, and here
is an example of different kinds of translation, EBMT and
semantic, sentence, syntactic [INAUDIBLE]
and so on.
So here is an EBMT example, example-based machine
And multi-engine translation.
I won't go through all of this.
You get three different translations, and it turns out
you need to figure out of a way of
selecting the right ones.
That itself turns out to be a hard problem.
How do you know this is the right set of things?
And that is determined by looking at the naturalness of
There are ways--
if you have ever worked on natural language generation as
opposed to analysis, this is one of the issues that you
spend time on.
OK, what we can do in new languages.
What we cannot do yet is an important set of things.
And at CMU we work on what we call orphan languages, obscure
language that Google may not be interested in or most
commercial organizations would not be interested in.
The reason we work on these is we get funded by DARPA.
DARPA wants to know how to translate Pushtu into English
or Swahili or some other language.
And that's not something that you would normally consider as
a research topic.
OK, the next interesting aspect of Bill of Rights is
the right level of detail.
And there a number of aspects of level of detail, and I will
present some of them here.
One of them is this issue of if you have a document, if I
have a report that came up on a Google search, and I want to
read it, I don't really have the time to
go through 300 pages.
If there was an automatic summarization technology that
would produce a summary for me, that would be very useful.
And there are such tools and techniques.
They're not good, but they're OK.
And people stop at that point.
It's really not adequate.
What you need is a hierarchical summarization.
You need a one line statement, a headline saying, what is his
And you need a paragraph, an abstract or something, and an
executive summary, and a regular summary.
And so you can have four or five levels of detail of the
And so one way we characterize it is we say, 10 word summary,
100 word summary, 1,000 word summary,
a 10,000 word summary.
And so information structuring here is not 10 words.
It's 20 I guess.
Headline, abstract, summary and document.
And the scope of the summary can also be different in
You can see what the issues are.
And this two by two matrix is very illustrative of the kind
of problems that you run into.
One of them is when you just go do genetic query like you
do on Google.
And there, if you're asking the question, do I want to
read any further?
What you want is a short abstract in the right column,
And if I want a summary where I don't have the time to read
a 100 page book or a 300 page book, give me a summary of 10
pages or 20 pages, then those are the executive summaries.
But neither of them are appropriate if you're actually
looking for a highly focused response to a question where
you, say, want to filter the search engine results by
clustering them or grouping them to determine whether you
want to read any further at all.
And the query element summarization for busy people
is, I don't want to read this article.
I want an answer.
And it turns out--
I used to think this was only important for busy people--
it's also relevant to an
illiterate person in a village.
So if this person goes to a computer or pays somebody to
do a search for him and says, I have AIDS.
What do I do?
What they don't want is 10,000 or 100,000 links to AIDS
They are not competent to read and summarize and analyze and
integrate the information that is there.
They are illiterate, or semi-literate.
So what you want is something that will actually understand
that knowledge and say, oh, you have AIDS?
You need this cocktail.
And then you say, where do I get it?
And you get another answer.
So it turns out the kinds of information that's needed when
you query a highly focused requirement, is essentially
you want the computer to solve the problem.
And so some of the types of work that Ask Jeeves and
others tried to do earlier is related to this.
And I think Google Answers might be related to it.
For whatever reason, Google Answers has never taken off,
or at least I haven't seen a lot of buzz about it, but
probably people use it still.
So maybe if we remove the cost angle, it'll be more
But solving problems is different than just providing
One of this is passive knowledge, and the other one
is active knowledge, right?
So in the right medium.
finding information and providing information in
non-textual media turns out to be very important also.
And there are a number of research issues in these areas
that many of us are working on.
I'm sure there may be work here too.
So, in conclusion, the purpose of this presentation was to
highlight the kind of research problems and concerns some of
us in academia are grappling with.
No doubt, there are people at Google research and Google
engineering that are also working on these problems. And
the issue is, how do we make substantial progress?
And the only way that will happen, I think, is for us to
begin to have professional conferences in each of these
topics, maybe organized by Google.
A one day or two day or three day symposium that will
actually bring all the people working in this thing in a
public disclosure, without having to send an NDA or
something, where you can actually present the ideas and
And that's one of the things I think we need, especially in
areas where things are still very fuzzy of what the right
way of approaching the problems are.
MALE SPEAKER: OK, any questions?
Otherwise we can go have coffee.
I had a question.
You were talking about trying to get some diversity of
results that [INAUDIBLE]
from the main cluster.
AUDIENCE: But if you do that without having some good
[INAUDIBLE] capability, then people lose the idea that most
people are talking about this thing in the middle, and all
the fringe concepts are considered the same way as the
You need some way to balance that.
RAJ REDDY: So the way to look at it is grouping together
relevant sub-topics into clusters doesn't imply you
have to throw away the details.
It might all be there.
But it's like Google News does, right?
It gives one headline, and one may be the main story, and
then it says 74 other stories.
And so if somebody wants to see how the same story was
reported in multiple different locations, you can get it.
But the interesting thing is in that scenario, we never
have the patience or time to go read all the 74 stories.
What I would like to see are those three stories that
provide a completely different
perspectives of the same story.
You take any of these stories when there are controversies
in Israeli or Palestinian newspapers, the same story
will have a very different way of presentation.
And that's not just there.
It can be between here and Mexico, it can be any place
where there are differences of opinion.
Between Google and Microsoft, the same story would be
reported slightly differently.
And what I want to see are those specific differences.
In what way are they different?
Because I don't have the time to read hundreds of pages to
figure out for me what the differences are.
If I had a human assistant, I'd say, you go figure it out
and tell me what the differences are.
And what we're looking for is the equivalent of [INAUDIBLE].
MALE SPEAKER: OK, thank you.