Tuesday, September 16, 2008

How does Google Index and Crawls

 Building  an index. To do this, we "invert" the crawl data; instead of having to scan for each word in every document, we juggle our data in order to list every document that contains a certain word. For example, the word "civil" might occur in documents 3, 8, 22, 56, 68, and 92, while the word "war" might occur in documents 2, 8, 15, 22, 68, and 77. 

Once we've built our index, we're ready to rank documents and determine how relevant they are. Suppose someone comes to Google and types in civil war. In order to present and score the results, we need to do two things:

  1. Find the set of pages that contain the user's query somewhere
  2. Rank the matching pages in order of relevance
The list of documents that contain a word is called a "posting list," and looking for documents with both words is called "intersecting a posting list." 
 Here 68 is common to both.

Ranking Results 
Now we have the set of pages that contain the user's query somewhere, and it's time to rank them in terms of relevance. Google uses many factors in ranking. Of these, the PageRank algorithm might be the best known. PageRank evaluates two things: how many links there are to a web page from other pages, and the quality of the linking sites. With PageRank, five or six high-quality links from websites such as www.cnn.com and www.nytimes.com would be valued much more highly than twice as many links from less reputable or established sites. 



1 comment:

Deepak said...

Useful basic info. Good work malai..