250 likes | 263 Views
Explore the capabilities of search engines beyond Google, including web crawling, information retrieval systems, document ranking, and PageRank algorithm. Discover if there are alternatives that can surpass Google.
E N D
Search Engine TechnologyCan We Do Better Than Google? Weiyi Meng Department of Computer Science State University of New York at Binghamton meng@cs.binghamton.edu www.cs.binghamton.edu/~meng/meng.html October 2004
Some Facts about Google • One of the largest search engines on the Web with 4.28 billion Web pages and 880 million images indexed. • 44% of English language searches are conducted • through Google. • There are 550 million searches performed everyday. • Google is powered by tens of thousands of • computers (Linux clusters).
Some Facts about Google • Google was founded in 1998 by two graduate students (Larry Page and Sergey Brin) from Stanford University. • Had IPO in August 2004. • Google is currently valued at about 40 billion dollars. • Larry Page and Sergey Brin are the latest members in billionaires club. • Has close to 2,000 employees with 60% of them new millionaires.
How Does Google Work? Google, like any other search engine, consists of two major components: • Index engine: Gather/crawl Web pages from all over the Web. • The program is often called a Web crawler, a Web robot, or a Web spider. • Retrieval engine: Retrieve Web pages that match with user queries from the crawled Web page collection. • The science for this component is from a computer science discipline known as information retrieval (IR).
How Web Crawler Works? A Web crawler is a program for fetching web pages from the Web. Main idea: • Place some initial URLs into a URL queue. • Repeat the steps below until the queue is empty • Take the next URL from the queue and fetch the web page using HTTP. • Extract new URLs from the downloaded web page and add them to the queue.
How IR System Works? document representation database Query Processed query Pre-processing Ranked Doc Ids Pre-processing Retrieved documents user documents
Document Pre-processing Document indexing is a process of representing the contents of each document in a document collection so as to facilitate efficient and accurate retrieval of desired documents from the collection. Two major tasks: • Determine index terms: Determine the set of terms to be used to represent each document. • Term weighting: Determine the weight/significance of each term in representing a document.
Term Weighting The weight/importance of a term in a document d is determined by two statistics: • term frequency (tf) : the number of times the term appears in the document • Document frequency (df) : the number of documents in the collection that contain the term
Organizing Pre-processed Documents • The organization is term-oriented, not document-oriented. • For each term, the following may be stored: • Which documents have the term and the term weights • How many documents have the term • Where the term appears in each document • Special data structures (e.g., inverted file index, hash table, etc.) are used to store the data for efficient access.
Ranking Documents for Queries For a given query, a document is typically ranked based on the following information: • How many terms in the query the document has? • How important are the terms in the document? • How closely the terms appear in the document? • What is the length of the document? • Where the terms appear in the document?
What is Special about Google? • Google’s document ranking algorithm incorporates the PageRank of each document. • The PageRank of a page is a quantitative measure of the weighted popularity of the page among all Web page authors. • PageRank is a measure of a web page’s global importance based on the backlinks of web pages. • A web page is more globally important if it is pointed to by more pages and/or by more important pages.
Computing PageRank Example: Suppose the Web graph is: Suppose initially, the PageRank of each page is 0.25. After 30 iterations: PageRank(A) = PageRank(B) = 0.176 PageRank(C) = 0.332, PageRank(D) = 0.316 D C A B
Why PageRank is Useful? Suppose we type in “Yahoo” as a query. • Without PageRank, a short document with relatively more occurrences of “Yahoo” is likely to be retrieved. • With PageRank, the homepage of Yahoo is more likely to be retrieved.
Can We (Others) Beat Google? • Historically, we have famous search engines that were developed earlier than Google • WebCrawler, Lycos, AltaVista, Yahoo, … • But, Google has become a phenomenon (Have you googled it?) and has become the largest pure search engine company. • It is going to be extremely difficult to beat Google! • However, others are not discouraged and are trying • Yahoo has bought AltaVista, Fast Search, Overture, … • More are trying: Amazon, Microsoft, Teoma, Vivisimo, …, Webscalers
Can We Do Better Than Google? Question: Can we do better than Google? Answer: Hard to say, because Google is also improving constantly. Question: Can we do better than the current Google? Answer: Definitely! Because Google is not perfect, not even close! Unfortunately, “can do better” does not always translate to “can beat”.
How To Do Better Than Google? We can try to do better than Google in two aspects: • Be more accurate in finding useful documents for each query. • Have wider coverage of the Web.
How To Do Better Than Google? How to be more accurate than Google? • Add more personalization • Not all “apples” are comparable. • Different perspectives on AIDS between the public and the scientists • Add searcher-based recommendation • Collaborative filtering: If many people say something is good, it probably is good (www.directhit.com) • Most search engines use only content-based recommendation • Google also uses author-based recommendation
How To Do Better Than Google? How to be more accurate than Google? (Cont.) • Consider subject-specific popularity(www.teoma.com) • Authoritative pages in relevant Web communities are likely to be good pages • Organize the search results better(www.vivisimo.com) • Clustering allows faster identification of relevant results
How To Do Better Than Google? How to be more accurate than Google? (Cont.) • Improve the quality of user queries • Add related terms to a query: “aids” and “hiv” • Understand user queries better • Differentiate different senses of query terms: “bank fraud” versus “bank of Susquehanna” • Retrieve information at appropriate level of details • Most current search engines just return Web pages • Sometimes it is better to return only extracted passages • It is even better to return answers to specific questions
How To Do Better Than Google? How to have wider coverage than Google? • Surface Web versus Deep Web • Surface Web: Web pages that are on the Web (i.e., have URLs). • Bow Tie theory about the Web • Deep Web: Documents and database records that are not on the Web but are searchable through the interfaces of some Web-based search systems. From IBM
How To Do Better Than Google? How to have wider coverage than Google? • The entire Web is estimated to contain more than 500 billion documents and database records. • The Surface Web contains only about 1% of the entire Web.
How To Do Better Than Google? • One way to have wider coverage than Google is to build a metasearch engine on top of many search engines on the Web. • A metasearch engine is a system that provides unified access to multiple existing search engines. • Metasearch engines can combine the coverage of multiple search engines. • Many metasearch engines exist on the Web: Dogpile, Mamma, ProFusion, Inquirus, …
Metasearch Engine query result user user interface search search search engine 1 engine 2 engine n . . . . . . text text text source 1 source 2 source n
Research at Binghamton University • We have been collaborating with researchers from University of Illinois at Chicago, University of Louisiana at Lafayette, and Webscalers on advanced search and metasearch technologies. • WebScales project: Develop an extremely large scale metasearch engine aiming to connect to all search engines on the Web. • SELEGO: A software that allows creating metasearch engines on the fly. • AllInOneNews: A metasearch engine for searching news.
Research at Binghamton University • Personalized search technology. • Improve the quality of user queries. • Differentiate the senses of query terms. • Integrate Web databases automatically.