220 likes | 245 Views
Information Retrieval Part 2. Sissi 11/17/2008. Information Retrieval cont. Web-Based Document Search Page Rank Anchor Text Document Matching Inverted Lists. Page Rank. PR(A) : the page rank of page A. C(T): the number of outgoing links from page T.
E N D
Information Retrieval Part 2 Sissi 11/17/2008
Information Retrieval cont.. • Web-Based Document Search • Page Rank • Anchor Text • Document Matching • Inverted Lists
Page Rank • PR(A):the page rank of page A. • C(T): the number of outgoing links from page T. • d: minimum value assigned to any page. • : a page pointing to A.
Algorithm of Page Rank • Use the PageRank Equation to compute PageRank for each page in the collection using latest PageRanks of pages. • Repeat step 1 until no significant change to any PageRank.
Example in the first iteration: • PR(A)=0.1+0.9*(PR(B)+PR(C)) =0.1+0.9*(1+1) =1.9 • PR(B)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95 • PR(C)=0.1+0.9*(PR(A)/2) =0.1+0.9*(1.9/2) =0.95 PR(A)=1.48, PR(B)=0.76, PR(C)=0.76 initial value: PR(A)=PR(B)=PR(C)=1 d=0.1
Anchor Text • The anchor text is the visible, clickable text in a hyperlink. • For example: • <a href=“http://www.wikipedia.org”>Wikipedia</a> • The anchor text is Wikipedia; the complex URL http://www.wikipedia.org/ displays on the web page as Wikipedia, contributing to a clean, easy to read text or document.
Anchor Text • Anchor text usually gives the user relevant descriptive or contextual information about the content of the link’s destination. • The anchor text may or may not be related to the actual text of the URL of the link. • The words contained in the Anchor Text can determine the ranking that the page will receive by search engines.
Common Misunderstanding • Webmasters sometimes tend to misunderstand anchor text. • Instead of turning appropriate words inside of a sentence into a clickable link, webmasters frequently insert extra text.
Example • today our troops have liberated another country from tyranny. To know more, click here. • The more concise way of coding that would be: today our troops have liberated another country from tyranny.
Anchor Text • This proper method of linking is beneficial not only to users, but also to the webmasters as anchor text holds significant weight in search engine ranking. • Most search engine optimization experts recommend against using “click here” to designate a link.
Google Bomb • In September 2000, the first Google bomb was created by Hugedisk Men’s Magazine, a now-defunct online humor magazine. • It linked the text “dumbmotherfucker” to a site selling George W. Bush-related merchandise. • A google search for this term would return the pro-Bush online store as its top result. • After a fair amount of publicity the George W. Bush-related merchandise site retained lawyers and sent a cease and desist letter to Hugedisk, thereby ending the Google bomb.
Existed Google Bomb • When search “more evil than Satan”, it returns the home page of microsoft company. • “miserable failure”, or “worst president”, or ”unelectable” it returns the resume of George W. Bush in the White House website. • “out of touch executives”, or “out of touch management” it returns the home page of google. • Other commercial use
Document Matching • An arbitrarily long document is the query, not just a few key words. • But the goal is still to rank and output an ordered list of relevant documents. • The most similar documents are found using the measures described earlier.
Generalization of searching • Matching a document to a collection of documents looks like a tedious and expensive operation. • Even for a short query, comparison to all large documents in the collection implies a relatively intensive computation task.
Example of document matching • Consider an online help desk, where a complete description of a problem is submitted. • That document could be matched to stored documents, hopefully finding descriptions of similar problems and solutions without having the user experiment with numerous key word searches.
Summarize • Search engines and document matchers are not focused on classification of new documents. • Their primary goal is to retrieve the most relevant documents from a collection of stored documents.
Inverted Lists • What is inverted lists? • Instead of documents pointing to words, a list of words pointing to documents is the primary internal representation for processing queries and matching documents.
Example • If the query contained words 100 and 200 • First processing W(100) to compute the similarity S(i) of each document i: S(1)=0+1 S(2)=0+1 … • Then process W(200) in the same way: S(2)=1+1 …
Summarize • The inverted list is the key to the efficiency of information retrieval systems. • The inverted list has contributed to make nearest-neighbor methods a pragmatic possibility for prediction.
Conclusion • Information retrieval methods are specialized nearest-neighbor methods, which are well-known prediction methods. • IR methods typically process unlabeled data and order and display the retrieved documents. • The IR methods have no training and induce no new rules for classification.