1 / 31

Search Engine Algorithms

Search Engine Algorithms. Now and the Future. Vincent Ng cstyng@comp.polyu.edu.hk Department of Computing Hong Kong Polytechnic University. What are search engines?. Search engines are huge databases of web page files that have been assembled automatically by machine.

lauren
Download Presentation

Search Engine Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engine Algorithms Now and the Future Vincent Ng cstyng@comp.polyu.edu.hk Department of Computing Hong Kong Polytechnic University

  2. What are search engines? Search engines are huge databases of web page files that have been assembled automatically by machine. A program that searches documents for specified keywords and returns a list of the documents where the keywords were found.

  3. Some History Yahoo 1997 2,000,000 pages Search Engine Watch 2000 2 billons pages WWWW 1994 110,000 pages

  4. Types of Search Engines • Individual • Individual search engines compile their own searchable databases on the web. • Meta • Meta-searchers do not compile databases. Instead, they search the databases of multiple sets of individual engines simultaneously

  5. Meta-search Engines • Do not crawl the web compiling their own searchable databases. • They search the databases of multiple sets of individual search engines simultaneously, from a single site and using the same interface. • Meta-searchers provide a quick way of finding out which engines are retrieving the best results for you in your search.

  6. Subject Directories • Unlike search engines, are created and maintained by human editors, not electronic spiders or robots. • The editors review and select sites for inclusion in their directories on the basis of previously determined selection criteria. • Directories tend to be smaller than search engine databases, typically indexing only the home page or top level pages of a site.

  7. Search Logic

  8. Search Logic

  9. Developing a search engine • No database, real time search • Use a database (e.g. MSSQL, Oracle) • Build indices of key words • Simple matching • Use a database in a server or multiple servers (server farms) • Develop search indices based on key words or meta-information • Develop a search structure indexer spider alg

  10. Searching on the Web client Query screen Search engine DB web Indexer spider

  11. Three Algorithms • A document is represented by • Occurrences of a keyword • Hyperlink structures • Different ranking algorithms • Boolean spread Activation • Most-cited • TFxIDF

  12. Boolean Spread Activation • Based on the occurrence or absence of keywords in a document • R i,q = Mj=1 ( C i,j ) • A better approach • R i,q = Mj=1 ( I i,j ) Pi C i,j Link factor

  13. Most-Cited • Takes advantage of information about hyperlinks between web pages • R i,q = Mk=1,k<>i ( Li I,k Mj=1 C k,j ) Li I,k No link

  14. TFxIDF • Based on the vector space model • R i,q =  term in query (0.5 + 0.5 (term freq of Qj in Pi)/ max term freq of a keyword in Pi)) • R i,q = R i,q / normalized factor • IDFj = log (N / NI=1 C I,j )

  15. An Excellent Search Engine - Google

  16. Result of Search

  17. More about Google • Much more accurate than most other search engines • But Run the same search on Yahoo (look for web pages) and surprise! - You will get the same results • Because – Yahoo is powered by Google!

  18. Google Internal

  19. Google Internal • Makes use of the link structure of the Web to calculate a quality ranking of each web page • PageRank • Utilizes link to improve search results

  20. PageRank • It can be thought of as a model of user behaviour • The probability that a random surfer visits a page • PR(A) = (1-d) + d(PR(T1)/C(T1) …. PR(Tn)/C(Tn))

  21. Other Search Engines • Personalization/ Context based • Individual, web-filtering • www.searchorbit.com • Multimedia search • Image search • www.altavista.com (not really) • http://disney.ctr.columbia.edu/webseek/

  22. Internal Search Engine • When under 100 web pages • One can do it real time • www.comp.polyu.edu.hk/~cstyng/hci.99/labs/search.htm • For a small web site • Direct matching is sufficient • Other web sites • Indexers are needed

  23. Finding a search engineA Check List • What platforms does the search engine and spider run on? Is it portable? • What programming languages is it written with? Is it internet/web enabled? Don't give me Fortran! • Can the vendor customize the system at a reasonable cost and turn-around time? • What about local technical support? • Is it designed to search the internet, intranet, and your local disks? • Can it handle different file formats, such as ASCII, HTML, WORD, PPT, etc. etc.?

  24. A Check List • Does it support BIG5, GB, and UNICODE? And do so efficiently? • Is it designed and optimized for the Web? Don't give me a relational engine! • Is it an English search engine retrofit with Chinese search? • Can you control what the spider indexes and how frequent it indexes? • Does it support full Boolean queries and relevance ranking? • Can you search by dates or by categories?

  25. A Check List • Can you search files that are on a specific host or of a certain file type? • Can you specify partial words (e.g., econom* and *port)? • Can you expand and translate a query? • What about speed? Don't forget to ask for insertion speed! • What about scalability? Can it exploit multiple servers and CPUs?

  26. A Check List • Is the spider/crawler fault tolerant? Can it endure link or host failures? • Can it be optimized according to user behaviors? • Can it search across secured servers?

  27. Some search engines in HK • Chinese.yahoo.com • www.goyoyo.com.hk • www.hksrch.com.hk • www.hkonly.com • www.gowhere.com.hk

  28. Your Input

More Related