1 / 26

Search Engine 101

Search Engine 101. Qu, Miao Nov. 2003. Agenda. Definition and Types Architecture Robot Overview How Search Engine Works? Problems of Current Search Engines An example: Google The Future of Search Engine Search Engine vs. Directory Reference. What Is Search Engine?.

azalee
Download Presentation

Search Engine 101

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Engine 101 Qu, Miao Nov. 2003

  2. Agenda • Definition and Types • Architecture • Robot Overview • How Search Engine Works? • Problems of Current Search Engines • An example: Google • The Future of Search Engine • Search Engine vs. Directory • Reference

  3. What Is Search Engine? • Search engines are tools that use computer programs called Spiders and Robots to gather information automatically. They can create specific databases according to the query of the user. Source: “Authority Guide to Evaluating Information on the Internet”, Alison Cooke, 1999

  4. The Types of Search Engines • Individual Search Engines • compile their own searchable databases on the web. • Google. • Meta Search Engines • do not compile databases. Instead, they search the databases of multiple sets of individual engines simultaneously • Metacrawler, vivisimo Source: http://www.sc.edu/beaufort/library/lesson1.html

  5. Web Search Engine Layers From description of the FAST search engine, by Knut Risvikhttp://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm

  6. Standard Web Search Engine Architecture Check for duplicates, store the documents DocIds crawl the web user query create an inverted index Inverted index Search engine servers Show results To user Source:http://www.sims.berkeley.edu/academics/courses/is202/f02/Lectures/Lecture22_2002_11_14_tbd.ppt

  7. Working Procedures Crawling the web (Robot) Establish the database (Robot) Query (searcher) Search the Database (Search Engine Software) Ranking (Search Engine Software) Interface with client Components Robot Index catalog, database of what the spider finds Search Engine Software program that sifts through the millions of pages recorded in the index to find matches to a search and rank them in order of what it believes is most relevant. Anatomy of Search Engine Source:http://searchenginewatch.com/webmasters/article.php/2168031

  8. Robot Overview • It is essential ingredient of all current web search tools. • A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced.It could be written in Perl, Java, C, C++ or others (e.g. Tcl/Tk) • Also be known as: spiders, wanderers, worms, crawlers, gatherers, intelligent agents • Could have other functions: • Measuring the size and scope of the web; • Maintaining a database of web page by checking old links for updates and relocation; • Mirroring sites; • Email address harvesting; • Etc. (Source: Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13. http://www.robotstxt.org/wc/faq.html#what)

  9. Robot Overview (cont.) • Interested in getting source code? • http://webharvest.sourceforge.net/ng/ (harvest, perl) • http://www.lub.lu.se/combine/ (combine, perl) • http://www.acme.com/java/software/Acme.Spider.html (Acme. spider, Java)

  10. Establishing Database • “It is important to remember that when you are using a search engine, you are NOT searching the entire web as it exists at this moment. You are actually searching a portion of the web, captured in a fixed index created at an earlier date.” Source: http://www.sc.edu/beaufort/library/lesson1.html

  11. How Robot Searches Web Pages • Robot does not wander in the web itself. It use HTTP to require documents from server. • “In general they start from a historical list of URLs, especially of documents with many links elsewhere, such as server lists, "What's New" pages, and the most popular sites on the Web.” • Strategy: • Depth-first: create a relatively comprehensive database on a few subjects; • Breadth-first: create databases touching more lightly on a wider variety of documents. (Source: Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13. http://www.robotstxt.org/wc/faq.html#what )

  12. If I Don’t Want to Be Indexed by Robot? • Robot.tex • A plaint text document which would be checked by robots; • An example; • Robots.txt implements the Robots Exclusion Protocol, which allows the web site administrator to define what parts of the site are off-limits to specific robot user agent names. • Robots META tag • Sample entry: • <META name="ROBOTS" content="NOINDEX"> • <META name="ROBOTS" ontent="NOFOLLOW"> • Many, but not all, search engine robots will recognize this tag and follow the rules for each page http://www.searchtools.com/robots/

  13. Ranking I • Once a search engine has used your search terms to gather "hits" from its database, it lists or "ranks" the resulting sites in order of its own estimation of their relevance. • In most cases, the rule for ranking is the Relevance Prediction. • Currently, search engines predict relevance based on two sets of factors: • those based on a site's content ; • those external to the site: http://www.searchengines.com/searchBasics1.html

  14. Ranking II • Factors based on a web site's content • Word frequency (How often search terms occur in a page in relationship to other text) • Location of search terms in the document (Are they in the title? Are they near the top of the page?) • Relational clustering (How many pages in the site contain the search terms?) • The site's design (Does it use frames? How fast does it load?) http://www.searchengines.com/searchBasics1.html

  15. Ranking III • Factors external to the site • Link popularity -- Sites with more links pointing to them are prioritized • Click popularity -- Sites visited more often are prioritized • "Sector" popularity -- Sites visited by certain demographic or social groups are prioritized (Note: This system requires user-provided information) • Business alliances among services -- Results from a partner search service are ranked higher • Pay-for-placement rankings -- Site owners pay for high rankings http://www.searchengines.com/searchBasics1.html

  16. How Search Engine Different from One Another? • The Robot • The databaseHow is the database cleaned up and filtered? The frequency with which sites are spidered affects the database's freshness. • The formula (Algorithms)Different search engines employ different search retrieval formulas, or algorithms, to provide relevant content in response to a user's query. • Features and functionality The various search engines have different bells and whistles to appeal to searchers' different experience levels or individual tastes. • The lookEngines' graphical user interfaces vary, as do the format in which they present their results. http://www.searchengines.com/searchDiffer1.html

  17. What Are the Problems of Current Search Engines? • The biggest problem facing users of web search engines today is the quality of the results they get back. • While the results are often amusing and expand users’ horizons, they are often frustrating and consume precious time. The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page

  18. Challenges in Search Engines • Search Engine Spam; • Quality of Content; • Quality Evaluation; • Web Conventions; • Avoid Duplicate Search (Host); • Vaguely Structured Data Source: Challenges in Search Engines, ACM SIGIR Forum, Volume 36 ,  Issue 2   Fall 2002 , Monika R. Henzinger , Rajeev Motwani, Craig Silverstein

  19. An Example: Google http://www.bu.edu/mfeldman/Google/

  20. Some Features: • What's not indexed • Registration pages, text in graphics and multimedia files (use Alt tags), XML, Java applets, comment tags, Acrobat files, spammers. • Content and location • Keywords should be close to each other; • Content should include keywords in text or links. • HTML Title • Seems to be a fact. • Meta tags • No. • Link popularity • Very important, especially from relevant page.

  21. Architecture Source: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page

  22. Main Technology Applied in Google: • PageRankTM: • A system for ranking web pages • Anchor Text Source: “ The Anatomy of a Large-Scale Hypertextual Web Search Engine”, Sergey Brin and Lawrence Page

  23. The Future of Search Engine • Theme Search Engine • best match between the page content and the evaluation of its page. • Get more straightforward answers. • More customized

  24. Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized after the query by relevance rankings or other scores Directories vs. Search Engines Source:http://www.sims.berkeley.edu/academics/courses/is202/f02/Lectures/Lecture22_2002_11_14_tbd.ppt

  25. Reference • Alison Cooke, “Authority Guide to Evaluating Information on the Internet”, 1999, • http://www.sc.edu/beaufort/library/lesson1.html • Danny Sullivan, “How Search Engines Work”, 2002, http://searchenginewatch.com/webmasters/article.php/2168031. • Knut Risvik, “From description of the FAST search engine”, ? http://www.infonortics.com/searchengines/sh00/risvik_files/frame.htm • Berkeley University, 2002, Source:http://www.sims.berkeley.edu/academics/courses/is202/f02/Lectures/Lecture22_2002_11_14_tbd.ppt • http://www.robotstxt.org/wc/faq.html#what) • Susan Maze, David Moxley and Donna J. Smith, “Authoritative Guide to Web Search Engines”, 1997, P13. • http://www.searchtools.com/robots/ • http://www.searchengines.com/searchBasics1.html • http://www.searchengines.com/textkeywords.htm

  26. Reference • http://www.searchengines.com/urlkeywords.html • http://www.searchengines.com/ranking_factors.html • http://www.searchengines.com/searchDiffer1.html • www.google.com • Danny Sullivan, “Major Search Engines and Directories”, 2003, http://searchenginewatch.com/links/article.php/2156221 • www.alltheweb.com • http://www.searchengines.com/partnerships.html • The Anatomy of a Large-Scale Hypertextual Web Search Engine,Sergey Brin and Lawrence Page, http://google.stanford.edu/~backrub/google.html. • Monika R. Henzinger, Rajeev Motwani, Craig Silverstein, 2002, “Challenges in Web Search Engines”, SIGIR FORUM, Fall 2002, Vol.36, No.02. • Robin Nobles, 2003, “The Future Of Search Engine Optimizing”, http://www.searchengineworkshops.com/articles/se-optimization-future.html • Gary H. Anthes, 2002, “The Future of the Search Engine”, http://www.computerworld.com/databasetopics/data/story/0,10801,70037,00.html

More Related