310 likes | 418 Views
Search Engine Algorithms. Now and the Future. Vincent Ng cstyng@comp.polyu.edu.hk Department of Computing Hong Kong Polytechnic University. What are search engines?. Search engines are huge databases of web page files that have been assembled automatically by machine.
E N D
Search Engine Algorithms Now and the Future Vincent Ng cstyng@comp.polyu.edu.hk Department of Computing Hong Kong Polytechnic University
What are search engines? Search engines are huge databases of web page files that have been assembled automatically by machine. A program that searches documents for specified keywords and returns a list of the documents where the keywords were found.
Some History Yahoo 1997 2,000,000 pages Search Engine Watch 2000 2 billons pages WWWW 1994 110,000 pages
Types of Search Engines • Individual • Individual search engines compile their own searchable databases on the web. • Meta • Meta-searchers do not compile databases. Instead, they search the databases of multiple sets of individual engines simultaneously
Meta-search Engines • Do not crawl the web compiling their own searchable databases. • They search the databases of multiple sets of individual search engines simultaneously, from a single site and using the same interface. • Meta-searchers provide a quick way of finding out which engines are retrieving the best results for you in your search.
Subject Directories • Unlike search engines, are created and maintained by human editors, not electronic spiders or robots. • The editors review and select sites for inclusion in their directories on the basis of previously determined selection criteria. • Directories tend to be smaller than search engine databases, typically indexing only the home page or top level pages of a site.
Developing a search engine • No database, real time search • Use a database (e.g. MSSQL, Oracle) • Build indices of key words • Simple matching • Use a database in a server or multiple servers (server farms) • Develop search indices based on key words or meta-information • Develop a search structure indexer spider alg
Searching on the Web client Query screen Search engine DB web Indexer spider
Three Algorithms • A document is represented by • Occurrences of a keyword • Hyperlink structures • Different ranking algorithms • Boolean spread Activation • Most-cited • TFxIDF
Boolean Spread Activation • Based on the occurrence or absence of keywords in a document • R i,q = Mj=1 ( C i,j ) • A better approach • R i,q = Mj=1 ( I i,j ) Pi C i,j Link factor
Most-Cited • Takes advantage of information about hyperlinks between web pages • R i,q = Mk=1,k<>i ( Li I,k Mj=1 C k,j ) Li I,k No link
TFxIDF • Based on the vector space model • R i,q = term in query (0.5 + 0.5 (term freq of Qj in Pi)/ max term freq of a keyword in Pi)) • R i,q = R i,q / normalized factor • IDFj = log (N / NI=1 C I,j )
More about Google • Much more accurate than most other search engines • But Run the same search on Yahoo (look for web pages) and surprise! - You will get the same results • Because – Yahoo is powered by Google!
Google Internal • Makes use of the link structure of the Web to calculate a quality ranking of each web page • PageRank • Utilizes link to improve search results
PageRank • It can be thought of as a model of user behaviour • The probability that a random surfer visits a page • PR(A) = (1-d) + d(PR(T1)/C(T1) …. PR(Tn)/C(Tn))
Other Search Engines • Personalization/ Context based • Individual, web-filtering • www.searchorbit.com • Multimedia search • Image search • www.altavista.com (not really) • http://disney.ctr.columbia.edu/webseek/
Internal Search Engine • When under 100 web pages • One can do it real time • www.comp.polyu.edu.hk/~cstyng/hci.99/labs/search.htm • For a small web site • Direct matching is sufficient • Other web sites • Indexers are needed
Finding a search engineA Check List • What platforms does the search engine and spider run on? Is it portable? • What programming languages is it written with? Is it internet/web enabled? Don't give me Fortran! • Can the vendor customize the system at a reasonable cost and turn-around time? • What about local technical support? • Is it designed to search the internet, intranet, and your local disks? • Can it handle different file formats, such as ASCII, HTML, WORD, PPT, etc. etc.?
A Check List • Does it support BIG5, GB, and UNICODE? And do so efficiently? • Is it designed and optimized for the Web? Don't give me a relational engine! • Is it an English search engine retrofit with Chinese search? • Can you control what the spider indexes and how frequent it indexes? • Does it support full Boolean queries and relevance ranking? • Can you search by dates or by categories?
A Check List • Can you search files that are on a specific host or of a certain file type? • Can you specify partial words (e.g., econom* and *port)? • Can you expand and translate a query? • What about speed? Don't forget to ask for insertion speed! • What about scalability? Can it exploit multiple servers and CPUs?
A Check List • Is the spider/crawler fault tolerant? Can it endure link or host failures? • Can it be optimized according to user behaviors? • Can it search across secured servers?
Some search engines in HK • Chinese.yahoo.com • www.goyoyo.com.hk • www.hksrch.com.hk • www.hkonly.com • www.gowhere.com.hk