540 likes | 894 Views
CS5286 Algorithms And Techniques for Web Search. Objective: Provide a practical introduction to algorithms and techniques for information retrieval over the Internet. ...
E N D
Slide 1:CS5286 Algorithms And Techniques for Web Search
Objective: Provide a practical introduction to algorithms and techniques for information retrieval over the Internet.
Slide 2:Lecturer: Professor DENG, Xiaotie Room Y6321 Ext 8632 Email: csdeng TA: SUN Wei Room CYC2207 Ext 8030 Email: sunwei@cs
Contact
Slide 3:Coursework: 50% 20% marks for quiz: two, each 10% of the final mark. 27% marks for a group project (2-3 people in a group). 3% participation points, at Discussion Forum, tutorials and classes (one point each). Examination: 50% one 1.5-hour examination. At least 30% examination marks are required to pass.
Assessment
Slide 4:Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, 1999. GUIDE TO SEARCH ENGINES, by Wes Sonnenreich and Tim Macinta, Wiley Computer Publishing, 1998.
Reference Books
Slide 5:Web access: Automated access to existing search engines The use of spiders/robots for web searching Collection of visitor information to one’s own web site Web mining: Ranking techniques for web sites on specific topics Automated abstract generation User profile Information retrieval Basic Models Major Query Operations Indexing and Searching New research topics
Students Will Acquire The Following
Slide 6:A history of search engines: http://www.wiley.com/legacy/compbooks/sonnenreich/webdev/history.html Java and the class URL (search under class net) http://java.sun.com/j2se/1.3/docs/api/index.html Free search engines written in Java: http://www.freewarejava.com/applets/search.shtml Robots: http://www.robotstxt.org/wc/robots.html
Some Helpful Web Sites
Slide 7:The Internet and Web Collection of Information over the Web Quiz 1 Models of Information Retrieval Query techniques Quiz 2 Start of Project Text Operations Indexing and Searching Techniques
Tentative Lecture Plan
Slide 8:The purpose: To provide hand-on experience learning Materials to be covered: Review of Java and Link to the Internet Functionality of Spider/Robot Access to Major Search Engines A simple search engine in Java In addition, we will conduct the following in tutorial sessions Submission and discussion of project proposal and plan Project Presentation
Tentative Tutorial Session Plan
Slide 9:Two or Three people in a group It is best to do a project that use one of the following available tools for some application problem. Spider/Robot Major Search Engines The simple search engine in Java Some example of possible projects: Build a network map of co-authorship relations. Build “relationship” networks by Internet information retrieval. Design a method to test which search engine covers more webpages. Start your project as early as possible.
Plan For The Group Project
Slide 10:Know how to program in JAVA. Or Capable of learning JAVA programming in one week or so. DROP the course if you don’t. We will have some quick quiz on JAVA to determine whether the course is suitable for you.
Pre-Requisites
Slide 11:Lecture 1: Introduction
Slide 12:A Simple Search Engine Architecture
Web User Spider Indexer Query Interface Query Engine Database
Slide 13: Major issues
Spider and communication between computer and the Internet Data/document model for information retrieval Query protocol design User profile techniques Interactive Information Retrieval Technique Design
Slide 14: Spiders
Automatically Retrieve web pages Start with an URL retrieve the associated web page Find all URLs on the web page recursively retrieve not-yet searched URLs Algorithmic Issues How to choose the next URL? Avoid overloaded sub-networks
Slide 15:Indexer
Selects terms to index for a document may utilise co-operation from web page authors through Meta tags to indicate specific terms to index <META name="keywords" content=“information retrieval”> Algorithmic issues: How to choose terms/phrases or other entities to index so as to accurately and efficiently respond to use queries
Slide 16:Database
Tradeoff of Hardware/Speed Efficiency Algorithmic issues: efficiency in space redundancy as trade-off for speed in query response Cost efficiency: How many computers to use? How to distribute load efficiently?
Slide 17:Query Engine
Return the most relevant documents for queries Algorithmic Issues: document model relevance analysis
Slide 18:Query Interface
Analyse user profiles generate user specific query result Algorithmic issues: Design of efficient and user-friendly query protocols
Slide 19: Interesting Problems
Finding the needle in the haystack: search for certain specific information on the Internet User-specific ranking of documents on the web how to collect and apply user information to provide better service Trust analysis of information on the web avoid providing false information Trustworthiness analysis of virtual identities over the Internet. http://www.firstgov.gov/Citizen/Topics/Internet_Fraud.shtml
Slide 20:Some Facts about the Internet
Slide 21:Statistics About Internet
Internet Domain Growth http://www.isc.org/index.pl?/ops/ds/ How to conduct Internet Domain Survey http://www.isc.org/ds/faq.html
Slide 22:Internet Growth Charts
http://www.cyveillance.com/web/us/newsroom/releases/2000/2000-07-10.htmhttp://www.cyveillance.com/web/us/newsroom/releases/2000/2000-07-10.htm
Slide 23:Internet Provides Varieties of Information
Text documents Multimedia files Interactive information services Internet group membership services Databases Frauds: Trojan horses and Phishing tricks
Slide 24:Major Features of Information Retrieval on the Internet
Large amount of information Rapid information update Dynamic hyperlink structure Varieties of data format, language, qualities
Slide 25:Some Difficulties for Internet Informational Retrieval System
Diversified user base (from layman to computer nerds). could we develop an evolving system that adapts to user? Language Ambiguity This becomes an especially important issue because of varieties of different data on the Internet How do we collect and apply user profiling techniques to resolve it?
Slide 26:Search Engines Today
Slide 27:Evolving Search Engines
Tools for finding information on the Web Problem: “hidden” databases, e.g. New York Times Directory A hand-constructed hierarchy of topics (e.g. Yahoo) Search engine A machine-constructed index (usually by keyword) Interactive Searching http://www.learnthenet.com/english/html/78tutorial.htm Specialized Searching Google Scholar: http://www.scholar.google.com/ Guide to find search engines http://www.searchenginecolossus.com/ New trends in search engines http://www.searchengineshowdown.com/
Slide 28:Coverage of Search Engine
Number of web pages covered Self claimed. Maybe include link-only without analyzing the page Page Depth The maximum amount of information indexed for an individual webpage. http://blog.searchenginewatch.com/blog/041111-084221
Slide 29:Search Engine Sizes (Apr. 6, 2001)
SOURCE: SEARCHENGINEWATCH.COM AV Altavista EX Excite FAST FAST GG Google Go Go (Infoseek) INK Inktomi NL Northern Light WT WebTop.com Estimated total web pages ~ 2 billion SHADED DATA FOR GG AND INKTOMI INCLUDES PAGES INDEXED BUT NOT VISITED SEARCHES/DAY (MILLIONS) 100 12 50 47 50 5
Slide 30:Search Engine Sizes (Dec 11, 2001)
SOURCE: http://searchenginewatch.com/reports/sizes.html AV Altavista EX Excite FAST FAST GG Google Go Go (Infoseek) INK Inktomi NL Northern Light WT WebTop.com
Slide 31:Search Engine Size Trends
SOURCE: http://searchenginewatch.com/reports/article.php/2156481#trend
Slide 32:Search Engines Disjointness
SOURCE: SEARCHENGINESHOWDOWN
Slide 33:Search Engines Uniqueness
SOURCE: http://www.searchengineshowdown.com/stats/overlap.shtml
Slide 34:Time Spent Per Visitor (minutes)by Search Engine, April 1999
SOURCE: http://www.nielsen-netratings.com/ AV Altavista EX Excite Go/IS Go/Infoseek GT GoTo HB Hotbot LS LookSmart LY Lycos MSN MSN NS Netscape WC Webcrawler YH Yahoo
Slide 35:Time Spent Per Visitor (minutes)by Search Engine, June 2002
SOURCE: http://searchenginewatch.com/reports/netratings.html MSN=MSN, YH=Yahoo, GG=Google, AOL=AOL, AJ=Ask Jeeves, IS=InfoSpace;OVR=Overture (GoTo), AV=AltaVista, NS=Netscape, LS=LookSmart, LY=Lycos;DP=Dogpile.
Slide 36:Total (millions of) Hours Spent onby Search Engine, June 2002
SOURCE: http://searchenginewatch.com/reports/netratings.html MSN=MSN, YH=Yahoo, GG=Google, AOL=AOL, AJ=Ask Jeeves, IS=InfoSpace;OVR=Overture (GoTo), AV=AltaVista, NS=Netscape, LS=LookSmart, LY=Lycos;DP=Dogpile.
Slide 37:Audience Reach by Search Engine, July , 2001
SOURCE: http://wreportus.mediametrix.com/clientCenter.html AJ Ask Jeeves AV Altavista DH Direct Hit DP Dogpile EX Excite GG Google GO Go/Infoseek G2N GoTo HB Hotbot iWN iWon LS LookSmart LY Lycos MC Metacrawler MM Mamma MSN MSN NL Northern Light NS Netscape WC Webcrawler YH Yahoo Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap
Slide 38:Audience Reach by Search Engine, Mar. 2002
SOURCE: http://searchenginewatch.com/reports/mediametrix.html MSN=MSN, YH=Yahoo, GG=Google, AOL=AOL, AJ=Ask Jeeves, LS=LookSmart,ISP=InfoSpace, NS=Netscape, OVR=Overture (GoTo). Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap
Slide 39:Start With Spider
Slide 40:Spider Architecture
Database Shared URL pool Database Interface url_spider Web Space url_spider url_spider url_spider url_spider spiders Http Request Http Response Add a new URL Get an URL
Slide 41:Communication
How a web browser communicates with computer How a browser communicates with the Internet How data travels through the Internet How a web browser communicates with a web server
Slide 42:Web Browser
A primary tool to gather information from the Internet Netscape Navigator now firefox Microsoft’s Internet Explorer
Slide 43:Web Server
It provides the connection of the computer to the Internet Serving Web pages to browsers It usually runs on TCP port 80
Slide 44:Uniform Resource Locator(URL)
The address of a web page on the net The web server is waiting at this address for the browsers. URL is used by a web browser to travel to the address and request desired Web page from the web server. If the web server give the page to the Web browser The browser then display it to user.
Slide 45:TCP/IP for Internet Connection
IP stands for Internet Protocol TCP stands for Transmission Control Protocol TCP is layered on top of IP The result communication system is TCP/IP.
Slide 46:The IP layer
Inter-network layer Data are breaking down into packets of fixed size and sent over to the destinations. IP address: consists of 4 8-bit numbers example: 144.214.37.200 Routes use IP address to send packets to their destinations packets of the same stream of data may go through different routes.
Slide 47:The TCP layer
A service provider protocol Provide a logical connection between the sender and the receiver of data over the unreliable network Its data integrity support functions and mechanism are the basis for application services such as FTP, Telnet, etc.
Slide 48:TCP/IP Port Number
One for each specific application layer service Used between two host computers to identify which application program is to receive the incoming traffic. 0-255 are pre-assigned and are called well-known ports. If you want to assign a port number to a specific application, use a number above 255.
Slide 49:Browser/Server Interaction
You type a URL (or click at it) your browser opens up a connection with the web server at the URL your browser tells the web server the particular page you want the web server sends back a response giving information about the page then sends back the appropriate page
Slide 50:The Spider
Does that automatically (without clicking on a line nor type a URL) It is an automated program that search the web. Read a web page store/index the relevant information on the page follow all the links on the page (and repeat the above for each link)
Slide 51:Caution About Using A Spider
It may puts an unexpected amount of traffic load if poorly written Be responsible for your actions Use a well-tested one instead of writing your own Test it locally before running it over the Internet Follow the standard guideline www.robotstxt.org/wc/guidelines.html
Slide 52:Tutorials
Start with a review of Java Then how to connect to the internet Use of spider Major functionality of search engine In addition, certain tasks will be assigned to gain the first hand experience in learning.
Slide 53:Today’s Tutorial
A typical Java program A typical Java program that uses a URL as input and return the content of the web page Some further questions will be left as your exercise.
Slide 54:Next Week’s Tutorial
Java network programming introduction HTTP introduction Java URL class for establish HTTP connection