220 likes | 365 Views
How Search Engines Work?. Ziv Bar-Yossef Department of Electrical Engineering Technion. What is the Internet?. A global network of computers connected to each other Computers “talk” to each other using standard protocols TCP/IP. What is the World-Wide Web (WWW)?.
E N D
How Search Engines Work? Ziv Bar-Yossef Department of Electrical Engineering Technion
What is the Internet? • A global network of computers connected to each other • Computers “talk” to each other using standard protocols • TCP/IP
What is the World-Wide Web (WWW)? • Collection of pages available via the Internet • Internet users can view pages with web browsers • WWW is only one application of the Internet • Other applications: email, messengers, VOIP, newsgroups, ftp
Web Pages • Various formats • pdf, word, excel, images, mp3, video, text • Most popular format: HTML • HTML pages point to each other using hyperlinks • Users “surf the web” by clicking hyperlinks
What are Search Engines? • Users have “information needs” • Where can I find solutions to my math homework problem? • Where can I find mp3s of Miri Messika’s latest album? • What is the weather in Eilat in Channuka? • What other Sharons are famous except for our prime minister? • Search engines enable us to find web pages that match our information needs
Search Engines “Information Need” What other Sharons are famous, except for our prime minister? User query Search Engine sharon -ariel Web • Sharon Creech • Sharon Stone • Sharon, Massachusetts Ranked list of matching pages Web pages
How Search Engines (don’t) Work? • Common misconception: when user submits a query, the search engine scans all web pages to find the relevant matches User query Search Engine sharon -ariel Web • Sharon Creech • Sharon Stone • Sharon, Massachusetts Ranked list of matching pages Web pages
How Search Engines Work? • What do you do when you look for a term in an encyclopedia? • Use the index! User query sharon -ariel Search Engine index Web • Sharon Creech • Sharon Stone • Sharon, Massachusetts Ranked list of matching pages Web pages
Search Engine Architecture Search Engine Crawler Index Query Processor Ranking Algorithm
Web Crawler (a.k.a. Spider) • Fetches web pages and stores them in a local repository • Tries to get as many web pages as possible • Follows hyperlinks to learn about new pages • Refetches pages that change frequently
The Index Index ariel: (cnn.com,1) dress: (hollywood.com,3) found: (cnn.com,8) gaultier: (hollywood.com,8) gown: (hollywood.com,9) israel: (cnn.com,7) jean: (hollywood.com,6) minister: (cnn.com,5) new: (cnn.com,7), (hollywood.com, 5) oscar: (hollywood.com,12) party: (cnn.com,12), (hollywood.com,14) paul: (hollywood.com,7) political: (cnn.com,11) prime: (cnn.com,4) sharon: (cnn.com,2), (hollywood.com,1) stone: (hollywood.com,2) www.cnn.com Ariel1 Sharon2, the3 prime4 minister5of6 Israel7 founded8a9 new10 political11 party12. www.hollywood.com Sharon1 Stone2 dressed3a4 new5 Jean6 Paul7 Gaultier8 gown9at10the11 Oscars12after13 party14.
Index by “Anchor Text” • Anchor text: what’s written inside a link • Example: Ariel Sharon, the prime minister… • Usually succinctly describes what’s written in the linked page • By which terms a page is listed in the index? • Terms that appear in the page • Terms that appear in anchor text of links to the page
Query Processor • Gets a user query • Fetches relevant posting lists from index • Extracts relevant matches from lists • Example: Query = “sharon –ariel” • L1 posting list of sharon • sharon: (cnn.com,2), (hollywood.com,1) • L2 posting list of ariel • ariel: (cnn.com,1) • Return all pages in L1 that do not occur in L2 • cnn.com
Ranking Algorithm • Many queries have many matching pages • 472 million matches for “London” in Google • Cannot return all of them to the user • User needs the most relevant results anyway • Need to order results by relevance • Most relevant results are at the top • Ranking algorithm: a method of ordering matches • The “heart” of a search engine • The reason why Google is the most preferred search engine today
Google’s PageRank • Ranking Elections • Candidates: all web pages • Voters: all web pages • p votes to q, if p has a hyperlink to q. • Favorites(p) = all the pages p votes for. • Fans(p) = all the pages that vote for p. • 1 if p has no fans
Google’s PageRank 1 1.5 • Underlying principles: • A page is “important” if it has important fans • A page splits its “importance” evenly among its favorite pages. 1 4 1 2.5 1
Google’s PageRank • Ranking algorithm: • Find pages that match the given query • Order them by their PageRank • Return top 10 matches
Conclusions • Search engines use index to answer user queries • Ranking is the most important component • Spam is a problem