Building an index. To do this, we "invert" the crawl data; instead of having to scan for each word in every document, we juggle our data in order to list every document that contains a certain word. For example, the word "civil" might occur in documents 3, 8, 22, 56, 68, and 92, while the word "war" might occur in documents 2, 8, 15, 22, 68, and 77.
Once we've built our index, we're ready to rank documents and determine how relevant they are. Suppose someone comes to Google and types in civil war. In order to present and score the results, we need to do two things:
- Find the set of pages that contain the user's query somewhere
- Rank the matching pages in order of relevance
Now we have the set of pages that contain the user's query somewhere, and it's time to rank them in terms of relevance. Google uses many factors in ranking. Of these, the PageRank algorithm might be the best known. PageRank evaluates two things: how many links there are to a web page from other pages, and the quality of the linking sites. With PageRank, five or six high-quality links from websites such as www.cnn.com and www.nytimes.com would be valued much more highly than twice as many links from less reputable or established sites.
1 comment:
Useful basic info. Good work malai..
Post a Comment