310 likes | 691 Views
Internet Search Engines. Julia Vuong Aaron Kurtzhals. Introduction. What is an Internet Search Engine Brief History Importance What an Internet Search Engine does. How it works. Definition. An internet search engine is an automated system used to find information on the world wide web.
E N D
Internet Search Engines Julia Vuong Aaron Kurtzhals
Introduction • What is an Internet Search Engine • Brief History • Importance • What an Internet Search Engine does. • How it works
Definition • An internet search engine is an automated system used to find information on the world wide web.
Brief History • 1994 – An early search engine, World Wide Web Worm, indexed about 110,000 web pages. • 1997 – Search engines indexed millions of web pages. • Today – Google indexes over 4 billion web pages.
Why are internet search engines important? • The internet contains large amounts of information. • But, this information is not always easy to find. • Where to look for the information? • DNS (_____) names are not very forgiving.
What does an Internet Search Engine do? • Crawls the web • Indexes web pages • Responds to searches
Spiders take a web Page’s content and create key search words that enable online users to find pages they’re looking for.
Robots (spiders, crawlers, bots) • A program that downloads web pages. • Similar to a browser, but generates machine readable information rather than a human readable display. • Purpose is to create an index of the internet.
Indexing the Internet • A search engine creates a database of words on the internet. • Information about each instance of the word is also stored.
Executing Searches • Find webpages that contain the desired words. • Webpages that contain the desired words are ranked and displayed to the user.
Problems with searching the Internet • The size and scope of the internet makes it difficult to search. • A search engine is not very good at understanding context. • Humans are only able to view a small number of search results.
Ranking Search Results • An internet search can generate thousands of results. • A person is only able to read the first few results. • How does the search engine decide the order of the results?
Ranking Algorithms • No ranking • Paid positioning • Content-based ranking • Pagerank
No ranking • Simple • Requires less storage • Fast • Less helpful to humans • Performance advantages are minimal.
Paid Positioning • If a webpage is willing to pay, it must be important • Similar but not identical to paid inclusion • Can create a backlash • Does not address the issue of ranking websites in general
Content-based Ranking • Attempt to determine context • Relative proximity of words • How many times a word appears on a page. • Usage in HTML tags
Words inside HTML tags • Title, Header • Links (anchors), both for the page the link is on, and the page the link points to • Allows search engines to find files not accessible to a crawler • Makes “Google-bombing” possible
Words inside HTML tags • Meta tags • Not intended to be rendered by browsers • Supposed to be “what is this page about”, so it would seem ideal for search engines • The use of meta tags to “fool” search engine is a serious drawback.
Pagerank • Algorithm to determine the “importance” of a website. • Developed by Google co-founders Sergey Brin and Lawrence Page • Based on hypertext links
Pagerank • Essentially, the Pagerank of a webpage A is calculated by the total Pagerank of all webpages that link to A. • The more links a webpage has, the less it contributes to the Pagerank of the link targets.
Pagerank Algorithm • Google’s current implementation of Pagreank is secret and probably different than the orginal. • For more information, see “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Brin and Page and other resources.
Pagerank Algorithm • We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85 ... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))
Google • Google uses a combination of Content and Pagerank to rank its search results. • Google is widely regarded as the best search engine. • Demo http://www.google.com/
The Future • The internet will continue to expand. • Research to improve the effectiveness of search engines continues. • Google plans to implement searchable web-based email. • Google’s lunar facility was an April Fool’s joke.
Conclusion • The internet contains billions of webpages. • Search engines allow people to use the internet more effectively. • Tasks performed by an internet search engine • Crawls the web • Indexes web pages • Responds to searches
Conclusion • People can only view a small number of webpages. • The effectiveness of a search engine depends greatly on how it ranks results.
Questions • Any questions or comments? • If you have questions later, ask Google.
References The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm Search Engine Research Papers by James Thornton http://jamesthornton.com/search-engine-research/ When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics (2000), by Krishna Bharat, George A. Mihaila (Google) www.cs.toronto.edu/~georgem/BM01.html... Exploiting the Block Structure of the Web for Computing PageRank(2003), by Sepandar Kamvar, Taher Heveliwala, Chris Manning, and Gene Golub (Stanford University) www.stanford.edu/~sdkamvar/papers/blockrank.pdf Writing a Web Crawler in the Java Programming Language by Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold of Muscle Fish, LLC January 1998 http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/index.html How Internet Search Engines Work by Curt Franklin http://computer.howstuffworks.com/search-engine.htm Checklist for Search Robot Crawling and Indexing (2003) by Avi Rappoport, Search Tools Consulting http://www.searchtools.com/robots/robot-checklist.html WebBase: Arepository of Web pages (2000), by Jun Hirai, Sriram Raghavan, Andreas Paepcke, and Hector Garcia-Molina (Stanfor University) dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=1999-26&format=pdf&compressi…