Practice English Speaking&Listening with: Uncrawled URLs in search results

Normal
(0)
Difficulty: 0

>> CUTTS: Okay. I wanted to talk to you today about robots.txt. One complaint that we often

hear is, "I blocked Google from crawling this page and robots.txt and you clearly violated

that robots.txt by crawling that page because it's showing up in Google search results."

A very common complaint, and so, here's how you can debug that. We've had the same robots.txt

handling for years and years and years. And we haven't found any bugs in it for several

years, and so, most of the time, what's happening is this. When someone's saying, "I blocked

example.com/go in robots.txt," it turns out that the snippets that we return in the search

results looks like this. And you'll notice, unlike most search results, there's not some

text here. Well the reason is that we didn't really crawl this page. We did abide by robots.txt.

You told us this page is blocked so we did not fetch this page. Instead, this is an uncrawled

URL. It's a URL reference. We saw a link to it, but we didn't fetch the page itself. And

so, because we didn't fetch the page itself, that's why you don't see a description or

some sort of snippet right in here. So it's kind of interesting because people often ask,

"Well, why do you show uncrawled URLs? What's the possible use case for that?" And let me

take you over here. At one point, the California Department of Motor Vehicles, which is www.dmv.ca.gov,

had a robots.txt that blocked all search engines. Now in these days, pretty much every site

is savvy enough, you know. At one point, the New York Times and eBay and a whole bunch

of different sites would use robots.txt. So if someone comes to Google and they type in

California DMV, there's pretty much one answer and this is what you want to be able to return.

So even though they were using robots.txt to say, "You're not allowed to crawl this

page," we still saw a lot of people linking into this page and they have the anchor text

California DMV. So if someone comes to Google and they--they do the query, California DMV,

it make sense that this is probably relevant to them. And we can return it even though

we haven't crawled the page. So that's the particular policy reason why we can sometimes

show uncrawled URL, because even though we didn't fetch the URL itself, we still know

from the anchor text of all the people that point to it that this is probably going to

be a useful result. Now the interesting thing is suppose you have a site like Nissan. For

a long time, Nissan, also Metallica, use robots.txt and had blocked all sites from being crawled.

This was years and years and years ago. Again, what we found is that we can go and find information

in the open directory project where Nissan and metallica.com were both mentioned in the

open directory project. And so sometimes, you'll see a snippet that looks almost like

it was crawled. But this description does not really come from crawling the page. It

comes from something like the Open Directory Project. So you can get--we are able to return

something that can be very helpful to users without violating robots.txt by not crawling

that page. Now if you truly don't want a page to show up, one of the best things that you

can do is let us crawl it and then use a "no index" Meta Tag at the top of the page. When

we see a "no index" tag, we'll drop it from our search results completely. Another option

you have is you can also say, "Use the URL removal tool." So if you block a site completely

in robots.txt, then you can use the URL removal tool and remove an entire site from Google's

index. And then it will never show up in that way as well. But it turns out for users being

able to return these uncrawled URLs can be very useful. That's the reason why we do it

and most of the time probably 90% of the time when someone says, "You're violating my robots.txt.

You've clearly crawled these pages." What's really happening is we're able to return that

uncrawled URL reference. And--and so that's what's going on. It's not that we've crawled

those pages. So those are a couple of easy ways that if you don't want your sites or

your page to show up you can block us in robots.txt and use the URL removal tool, or on all the

different pages, you can use a "no index" tag. And then once we crawl that page and

see the "No Index" tag, we'll drop that page from our index completely.

The Description of Uncrawled URLs in search results