520 likes | 620 Views
Searching the Web. Representation and Management of Data on the Internet. Goal. To better understand Web search engines: Fundamental concepts Main challenges Design issues Implementation techniques and algorithms. What does it do?. Processes users queries
E N D
Searching the Web Representation and Management of Data on the Internet
Goal • To better understand Web search engines: • Fundamental concepts • Main challenges • Design issues • Implementation techniques and algorithms
What does it do? • Processes users queries • Finds pages with related information • Returns a resources list • Is it really that simple? • Is creating a search engine much more difficult than ex1 + ex2?
Motivation • The web is • Used by millions • Contains lots of information • Link based • Incoherent • Changes rapidly • Distributed • Traditional information retrieval was built with the exact opposite in mind
The Web’s Characteristics • Size • Over a billion pages available (Google is a spelling of googol = 10100) • 5-10K per page => tens of terrabytes • Size doubles every 2 years • Change • 23% change daily • About half of the pages do not exist after 10 days • Bowtie structure
Bowtie Structure Reachable from core (22%) Reach the core (22%) Core: Strongly connected component (28%)
Search Engine Components • User Interface • Crawler • Indexer • Ranker
HTML Forms • Search engines usually use an HTML form. How are forms defined?
HTML Behind the Form • Defines an HTML form that: • uses the HTTP method GET (you could use POST instead) • will send form info to http://search.dbi.com/search <form method="get“ action="http://search.dbi.com/search"> Search For: <input type="text" name="query"> <input type="submit" value="Search"> <input type="reset" value="Clear"> </form>
HTML Behind the Form • Defines a text box • name=“query” defines the parameter “query” which will get the value of this text box when the data is submitted <form method="get“ action="http://search.dbi.com/search"> Search For: <input type="text" name="query"> <input type="submit" value="Search"> <input type="reset" value="Clear"> </form>
HTML Behind the Form • The submit button, labeled with “Search” • When this button is pressed, an HTTP request will be generated of the following form: • GET http://search.dbi.com/search?query=encode(text_box) HTTP/1.1 • If there were additional parameters defined, they would be added to the url with the & sign dividing parameters <form method="get“ action="http://search.dbi.com/search"> Search For: <input type="text" name="query"> <input type="submit" value="Search"> <input type="reset" value="Clear"> </form>
Example http://search.dbi.com/search?query=bananas+apples bananas apples
Post Versus Get • Suppose we had the line <form method=“post” action="http://search.dbi.com/search"> • Then, pressing submit would cause a POST HTTP request to be sent. • The values of the parameters would be sent in the body of the request, instead of as part of the url
HTML Behind the Form • The reset button, labeled with “Clear” • Clears the form <form method="get“ action="http://search.dbi.com/search"> Search For: <input type="text" name="query"> <input type="submit" value="Search"> <input type="reset" value="Clear"> </form>
removeBestPage( ) findLinksInPage( ) insertIntoQueue( ) Basic Crawler (Spider) Queue of Pages A crawler finds web pages to download into a search engine cache
Choosing Pages to Download • Q: Which pages should be downloaded? • A: It is usually not possible to download all pages because of space limitations. Try to get the most important pages • Q: When is a page important? • A: Use a metric – by interest, by popularity, by location, or combination
Interest Driven • Suppose that there is a query Q that contains the words we will be interested in. • Define the importance of a page P by its textual similarity to Q • Example: TF-IDF(P, Q) = Sum w in Q (TF(P,w)/DF(w)) • Problem: We must decide if a page is important while crawling. However, we don’t know DF until the crawl is complete • Solution: Use an estimate This is what you are using in Ex2!
Popularity Driven • The importance of a page P is proportional to the number of pages with a link to P • This is also called the number of back links of P • As before, need to estimate this amount • There is a more sophisticated metric, called PageRank (will be taught later in the course)
Location Driven • The importance of P is a function of its url • Example: • Words appearing on URL (e.g. com) • Number of “/” on the URL • Easily evaluated, requires no data from pervious crawls • Note: We can also use a combination of all three metrics
Refreshing Web Pages • Pages that have been downloaded must be refreshed periodically • Q: Which pages should be refreshed? • Q: How often should we refresh a page? In Ex2, you never refresh pages
Freshness Metric • A cached page is fresh if it is identical to the version on the web • Suppose that S is a set of pages (i.e., a cache) Freshness(S) = (number of fresh pages in S) number of pages in S
Age Metric • The age of a page is the number of days since it was refreshed • Suppose that S is a set of pages (i.e., a cache) Age(S) = Average age of pages in S
Refresh Goal • Goal: Minimize the age of a cache an maximize the freshness of a cache. • Crawlers can refresh only a certain amount of pages in a period of time. • The page download resource can be allocated in many ways • We need a refresh strategy
Refresh Strategies • Uniform Refresh: The crawler revisits all pages with the same frequency, regardless of how often they change • Proportional Refresh: The crawler revisits a page with frequency proportional to the page’s change rate (i.e., if it changes more often, we visit it more often) Which do you think is better?
Trick Question • Two page database • e1changes daily • e2changes once a week • Can visit one page per week • How should we visit pages? • e1 e2 e1 e2 e1 e2 e1 e2... [uniform] • e1 e1 e1 e1 e1 e1 e1 e2 e1 e1 …[proportional] • e1 e1 e1 e1 e1 e1 ... • e2 e2 e2 e2 e2 e2 ... • ? e1 e1 e2 e2 web database
Proportional Often Not Good! • Visit fast changing e1 get 1/2 day of freshness • Visit slow changing e2 get 1/2 week of freshness • Visiting e2is a better deal!
Another Example • The collection contains 2 pages: e1 changes 9 times a day, e2 changes once a day • Simplified change model: • Day is split into 9 equal intervals: e1 changes once on each interval, and e2 changes once during the day • Don’t know when the pages change within the intervals • The crawler can download a page a day. • Our goal is to maximize the freshness
Which Page Do We Refresh? • Suppose we refresh e2 in midday • If e2 changes in first half of the day, it remains fresh for the rest (half) of the day. • 50% for 0.5 day freshness increase • 50% for no increase • Expectancy of 0.25 day freshness increase
Which Page Do We Refresh? • Suppose we refresh e1 in midday • If E1 changes in first half of the interval, and we refresh in midday (which is the middle of the interval), it remains fresh for the rest half of the interval = 1/18 of a day. • 50% for 1/18 day freshness increase • 50% for no increase • Expectancy of 1/36 day freshness increase
Not Every Page is Equal! • Suppose that e1 is accessed twice as often as e2 • Then, it is twice as important to us that e1 is fresh than it is that e2 is fresh
Politeness Issues • When a crawler crawls a site, it uses the site’s resources: • web server needs to find file in file system • web server needs to send file in the network • If a crawler asks for many of the pages and at a high speed it may • crash the sites web server or • be banned from the site • Solution: Ask for pages “slowly”
Politeness Issues (cont) • A site may identify pages that it doesn’t want to be crawled • A polite crawler will not crawl these sites (although nothing stops the crawler from being impolite) • Put a file called robots.txt at the main directory to identify pages that should not be crawled (e.g., http://www.cnn.com/robots.txt)
robots.txt • Use the header User-Agent to identify programs whose access should be restricted • Use the header Disallow to identify pages that should be restricted Example
Other Issues • Suppose that a search engine uses several crawlers at the same time (in parallel) • How can we make sure that they are not doing the same work?
Storage Challenges • Scalability: Should be able to store huge amounts of data (data spans disks or computers) • Dual Access Mode: Random access (find specific pages) and Streaming access (find large subsets of pages) • Large Batch Updates: Reclaim old space, avoid access/update conflicts • Obsolete Pages: Remove pages no longer on the web (how do we find these pages?)
Update Strategies • Updates are generated by the crawler • Several characteristics • Time in which the crawl occurs and the repository receives information • Whether the crawl’s information replaces the entire database or modifies parts of it
Batch Crawler vs. Steady Crawler • Batch mode • Periodically executed • Allocated a certain amount of time • Steady mode • Run all the time • Always send results back to the repository
Partial vs. Complete Crawls • A batch mode crawler can either do • A complete crawl every run, and replace entire cache • A partial crawl and replace only a subset of the cache • The repository can implement • In place update: Replaces the data in the cache, thus, quickly refreshes pages • Shadowing: Create a new index with updates, and later replace the previous, thus, avoiding refresh-access conflicts
Partial vs. Complete Crawls • Shadowing resolves the conflicts between updates and read for the queries • Batch mode suits well with shadowing • Steady crawler suits with in place updates
Types of Indices • Content index: Allow us to easily find pages with certain words • Links index: Allow us to easily find links between pages • Utility index: Allow us to easily find pages in certain domain, or of a certain type, etc. • Q: What do we need these for?
Is the Content Index From Ex1 Good? • In Ex1, most of you had a table: • We want to quickly find pages with a specific word • Is this a good way of storing a content index?
Is the Content Index From Ex1 Good? NO • If a word appears in a thousand documents, then the word will be in a thousand rows. Why waste the space? • If a word appears in a thousand documents, we will have to access a thousand rows in order to find the documents • Does not easily support queries that require multiple words
Inverted Keyword Index evil: (1, 5, 11, 17) saddam: (3, 5, 11, 17) war: (3, 5, 17, 28) • lists are sorted by urlId butterfly: (22, 4) lists of matching documents as the values Hashtable Words as keys
Query: “evil saddam war” Algorithm: Always advance pointer(s) with lowest urlId evil: (1, 5, 11, 17) saddam: (3, 5, 11, 17) Answers: war: (3, 5, 17, 28) 5 17
Challenges • Index build must be : • Fast • Economic • Incremental Indexing must be supported • Tradeoff when using compression: memory is saved but time is lost compressing and uncompressing
How do we distribute the indices between files? • Local inverted file • Each file contains disjoint random pages of the index • Query is broadcasted. • Result is the merged query answers. • Global inverted file • Each file is responsible for a subset of terms in the collection. • Query “sent”only to the apropriate files