Internet Search Engines

Internet Search Engines Julia Vuong Aaron Kurtzhals

Introduction • What is an Internet Search Engine • Brief History • Importance • What an Internet Search Engine does. • How it works

Definition • An internet search engine is an automated system used to find information on the world wide web.

Brief History • 1994 – An early search engine, World Wide Web Worm, indexed about 110,000 web pages. • 1997 – Search engines indexed millions of web pages. • Today – Google indexes over 4 billion web pages.

Why are internet search engines important? • The internet contains large amounts of information. • But, this information is not always easy to find. • Where to look for the information? • DNS (_____) names are not very forgiving.

What does an Internet Search Engine do? • Crawls the web • Indexes web pages • Responds to searches

Spiders take a web Page’s content and create key search words that enable online users to find pages they’re looking for.

Robots (spiders, crawlers, bots) • A program that downloads web pages. • Similar to a browser, but generates machine readable information rather than a human readable display. • Purpose is to create an index of the internet.

Indexing the Internet • A search engine creates a database of words on the internet. • Information about each instance of the word is also stored.

Executing Searches • Find webpages that contain the desired words. • Webpages that contain the desired words are ranked and displayed to the user.

Problems with searching the Internet • The size and scope of the internet makes it difficult to search. • A search engine is not very good at understanding context. • Humans are only able to view a small number of search results.

Ranking Search Results • An internet search can generate thousands of results. • A person is only able to read the first few results. • How does the search engine decide the order of the results?

Ranking Algorithms • No ranking • Paid positioning • Content-based ranking • Pagerank

No ranking • Simple • Requires less storage • Fast • Less helpful to humans • Performance advantages are minimal.

Paid Positioning • If a webpage is willing to pay, it must be important • Similar but not identical to paid inclusion • Can create a backlash • Does not address the issue of ranking websites in general

Content-based Ranking • Attempt to determine context • Relative proximity of words • How many times a word appears on a page. • Usage in HTML tags

Words inside HTML tags • Title, Header • Links (anchors), both for the page the link is on, and the page the link points to • Allows search engines to find files not accessible to a crawler • Makes “Google-bombing” possible

Words inside HTML tags • Meta tags • Not intended to be rendered by browsers • Supposed to be “what is this page about”, so it would seem ideal for search engines • The use of meta tags to “fool” search engine is a serious drawback.

Pagerank • Algorithm to determine the “importance” of a website. • Developed by Google co-founders Sergey Brin and Lawrence Page • Based on hypertext links

Pagerank • Essentially, the Pagerank of a webpage A is calculated by the total Pagerank of all webpages that link to A. • The more links a webpage has, the less it contributes to the Pagerank of the link targets.

Pagerank Algorithm • Google’s current implementation of Pagreank is secret and probably different than the orginal. • For more information, see “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Brin and Page and other resources.

Pagerank Algorithm • We assume page A has pages T1...Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. We usually set d to 0.85 ... Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: • PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Google • Google uses a combination of Content and Pagerank to rank its search results. • Google is widely regarded as the best search engine. • Demo http://www.google.com/

The Future • The internet will continue to expand. • Research to improve the effectiveness of search engines continues. • Google plans to implement searchable web-based email. • Google’s lunar facility was an April Fool’s joke.

Conclusion • The internet contains billions of webpages. • Search engines allow people to use the internet more effectively. • Tasks performed by an internet search engine • Crawls the web • Indexes web pages • Responds to searches

Conclusion • People can only view a small number of webpages. • The effectiveness of a search engine depends greatly on how it ranks results.

Questions • Any questions or comments? • If you have questions later, ask Google.

References The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page http://www7.scu.edu.au/programme/fullpapers/1921/com1921.htm Search Engine Research Papers by James Thornton http://jamesthornton.com/search-engine-research/ When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics (2000), by Krishna Bharat, George A. Mihaila (Google) www.cs.toronto.edu/~georgem/BM01.html... Exploiting the Block Structure of the Web for Computing PageRank(2003), by Sepandar Kamvar, Taher Heveliwala, Chris Manning, and Gene Golub (Stanford University) www.stanford.edu/~sdkamvar/papers/blockrank.pdf Writing a Web Crawler in the Java Programming Language by Thom Blum, Doug Keislar, Jim Wheaton, and Erling Wold of Muscle Fish, LLC January 1998 http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/index.html How Internet Search Engines Work by Curt Franklin http://computer.howstuffworks.com/search-engine.htm Checklist for Search Robot Crawling and Indexing (2003) by Avi Rappoport, Search Tools Consulting http://www.searchtools.com/robots/robot-checklist.html WebBase: Arepository of Web pages (2000), by Jun Hirai, Sriram Raghavan, Andreas Paepcke, and Hector Garcia-Molina (Stanfor University) dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=1999-26&format=pdf&compressi…

Internet Search Engines