Search Engine Algorithms

Search Engine Algorithms Now and the Future Vincent Ng cstyng@comp.polyu.edu.hk Department of Computing Hong Kong Polytechnic University

What are search engines? Search engines are huge databases of web page files that have been assembled automatically by machine. A program that searches documents for specified keywords and returns a list of the documents where the keywords were found.

Some History Yahoo 1997 2,000,000 pages Search Engine Watch 2000 2 billons pages WWWW 1994 110,000 pages

Types of Search Engines • Individual • Individual search engines compile their own searchable databases on the web. • Meta • Meta-searchers do not compile databases. Instead, they search the databases of multiple sets of individual engines simultaneously

Meta-search Engines • Do not crawl the web compiling their own searchable databases. • They search the databases of multiple sets of individual search engines simultaneously, from a single site and using the same interface. • Meta-searchers provide a quick way of finding out which engines are retrieving the best results for you in your search.

Subject Directories • Unlike search engines, are created and maintained by human editors, not electronic spiders or robots. • The editors review and select sites for inclusion in their directories on the basis of previously determined selection criteria. • Directories tend to be smaller than search engine databases, typically indexing only the home page or top level pages of a site.

Search Logic

Developing a search engine • No database, real time search • Use a database (e.g. MSSQL, Oracle) • Build indices of key words • Simple matching • Use a database in a server or multiple servers (server farms) • Develop search indices based on key words or meta-information • Develop a search structure indexer spider alg

Searching on the Web client Query screen Search engine DB web Indexer spider

Three Algorithms • A document is represented by • Occurrences of a keyword • Hyperlink structures • Different ranking algorithms • Boolean spread Activation • Most-cited • TFxIDF

Boolean Spread Activation • Based on the occurrence or absence of keywords in a document • R i,q = Mj=1 ( C i,j ) • A better approach • R i,q = Mj=1 ( I i,j ) Pi C i,j Link factor

Most-Cited • Takes advantage of information about hyperlinks between web pages • R i,q = Mk=1,k<>i ( Li I,k Mj=1 C k,j ) Li I,k No link

TFxIDF • Based on the vector space model • R i,q =  term in query (0.5 + 0.5 (term freq of Qj in Pi)/ max term freq of a keyword in Pi)) • R i,q = R i,q / normalized factor • IDFj = log (N / NI=1 C I,j )

An Excellent Search Engine - Google

Result of Search

More about Google • Much more accurate than most other search engines • But Run the same search on Yahoo (look for web pages) and surprise! - You will get the same results • Because – Yahoo is powered by Google!

Google Internal

Google Internal • Makes use of the link structure of the Web to calculate a quality ranking of each web page • PageRank • Utilizes link to improve search results

PageRank • It can be thought of as a model of user behaviour • The probability that a random surfer visits a page • PR(A) = (1-d) + d(PR(T1)/C(T1) …. PR(Tn)/C(Tn))

Other Search Engines • Personalization/ Context based • Individual, web-filtering • www.searchorbit.com • Multimedia search • Image search • www.altavista.com (not really) • http://disney.ctr.columbia.edu/webseek/

Internal Search Engine • When under 100 web pages • One can do it real time • www.comp.polyu.edu.hk/~cstyng/hci.99/labs/search.htm • For a small web site • Direct matching is sufficient • Other web sites • Indexers are needed

Finding a search engineA Check List • What platforms does the search engine and spider run on? Is it portable? • What programming languages is it written with? Is it internet/web enabled? Don't give me Fortran! • Can the vendor customize the system at a reasonable cost and turn-around time? • What about local technical support? • Is it designed to search the internet, intranet, and your local disks? • Can it handle different file formats, such as ASCII, HTML, WORD, PPT, etc. etc.?

A Check List • Does it support BIG5, GB, and UNICODE? And do so efficiently? • Is it designed and optimized for the Web? Don't give me a relational engine! • Is it an English search engine retrofit with Chinese search? • Can you control what the spider indexes and how frequent it indexes? • Does it support full Boolean queries and relevance ranking? • Can you search by dates or by categories?

A Check List • Can you search files that are on a specific host or of a certain file type? • Can you specify partial words (e.g., econom* and *port)? • Can you expand and translate a query? • What about speed? Don't forget to ask for insertion speed! • What about scalability? Can it exploit multiple servers and CPUs?

A Check List • Is the spider/crawler fault tolerant? Can it endure link or host failures? • Can it be optimized according to user behaviors? • Can it search across secured servers?

Some search engines in HK • Chinese.yahoo.com • www.goyoyo.com.hk • www.hksrch.com.hk • www.hkonly.com • www.gowhere.com.hk

Your Input

Search Engine Algorithms

Search Engine Algorithms

Presentation Transcript

Search Algorithms

Search Engine

Search Engine

Search Engine

Search Engine

Search Engine

Search algorithms

SEARCH ENGINE

Search Algorithms

Search Engine

Search Engine

Search Engine

Search engine

An Introduction to Search Engine Algorithms

Search Engine

search engine

Search Engine

SEARCH ENGINE

Search Algorithms