Searching the Web

Searching the Web

The Web • Why is it important: • “Free” ubiquitous information resource • Broad coverage of topics and perspectives • Becoming dominant information collection • Growth and jobs • Web access methods Search (e.g. Google) Directories (e.g. Yahoo!) Other …

Web Characteristics • Distributed data • 80 million web sites (hostnames responding) in April 2006 • 40 million active web sites (don’t redirect, …) • High volatility • Servers come and go … • Large volume • One study found 11.5 billion pages in January 2005 (at that time Google indexed 8 billion pages) • “Dark Web” – content not indexed, not crawlableestimated to be 4,200 to 7,500 terabytes in 2000 (when there were ~ 2.5 billion indexable pages)

Web Characteristics • Unstructured data • Lots of duplicated content (30% estimate) • Semantic duplication much higher • Quality of data • No required editorial process • Many typos and misspellings (impacts IR) • Heterogeneous data • Different media • Different languages • These characteristics are not going to change

Web Content Types Source: How much information 2003

Search Engine Architecture Lots and lots of computers Interface Query Engine Users Index Crawler Indexer Web

Search Engine Architecture Evaluation Chapter 3 Interface Query Engine Users Chapter 10 Chapters 2, 4, & 5 Index Crawler Indexer Chapter 8 Chapters 6 & 7 Web

Search Engine Architecture Interface Query Engine Users Chapter 10 Chapters 2, 4, & 5 Index Crawler Indexer Chapter 8 Chapters 6 & 7 Web

Hubs and Authorities • Hubs • Have lots of links to other pages • Authorities • Have lots of links that point to them • Can use feedback to rank hubs and authorities • Better hubs have links to good authorities • Better authorities have links from good hubs

Crawling the Web • Creating a Web Crawler (Web Spider) • Simplest Technique • Start with a set of URLs • Extract URLs pointed to in original set • Continue using either breadth-first or depth-first • Works for one crawler but hard to coordinate for many crawlers • Partition Web by domain name, ip address, or other technique • Each crawler has its own set but shares a to-do list

Crawling the Web • Need to recrawl • Indexed content is always out of date • Sites come and go and sometimes disappear for periods of time to reappear • Order of URLs traversed makes a difference • Breadth first matches hierarchic organization of content • Depth first gets to deeper content faster • Proceeding to “better” pages first can also help (e.g. good hubs and good authorities)

Server and Author Control of Crawling • Avoid crawling sites that do not want to be crawled • Legal issue • Robot exclusion protocol (Server level control) • file that indicates which portion of web site should not be visited by crawler • http://.../robots.txt • Robot META tag (Author level control) • used to indicate if a file (page) should be indexed or analyzed for links • few crawlers implement this • <meta name=“robots” content=“noindex, nofollow”) • http://www.robotstxt.org/wc/exclusion.html

Example robots.txt Files • TAMU Library User-agent: * Disallow: /portal/site/chinaarchive/template.REGISTER/ Disallow: /portal/site/Library/template.REGISTER/ … • Google User-agent: * Allow: /searchhistory/ Disallow: /search Disallow: /groups Disallow: /images Disallow: /catalogs Disallow: /catalogues … • New York Times User-agent: * Disallow: /pages/college/ … Allow: /pages/ Allow: /2003/ … User-agent: Mediapartners-Google* Disallow: • CSDL User-agent: * Disallow: /FLORA/arch_priv/ Disallow: /FLORA/private/

Crawling Goals • Crawling technique may depend on goal • Types of crawling goals: • Create large broad index • Creating a focused topic or domain-specific index • Target topic-relevant sites • Index preset terms • Creating a subset of content to model characteristics of (part of) the Web • Need to survey appropriately • Cannot use simple depth-first or breadth-first • Create up-to-date index • Use estimated change frequencies

Crawling Challenges • Identifying and keeping track of links • Which to visit • Which have been visited • Issues • Relative vs. absolute link descriptions • Alternate server names • Dynamically generated pages • Server-side scripting • Links buried in scripts

Crawling Architecture • Crawler components Worker threads – attempt to retrieve data for URL DNS resolver – resolves domain names into IP addresses Protocol modules – downloads content in appropriate protocol Link extractor – finds and normalizes URLs URL filter – determines which URLs to add to to-do list URL to-do agent – keeps list of URLs to visit

Crawling Issues • Avoid overloading servers • Brute force approach can become a denial of service attack • Weak politeness guarantee: only one thread allowed to contact a server • Stronger politeness guarantee: maintain queues for each server that put URLs into the to-do list based on priority and load factors • Broken links, time outs • How many times to try? • How long to wait? • How to recognize crawler traps? (server-side programs that generate “infinite” links)

Web Tasks • Precision is the key • Goal: first 10-100 results should satisfy user • Requires ranking that matches user’s need • Recall is not important • Completeness of index is not important • Comprehensive crawling is not important

Browsing • Web directories • Human-organized taxonomies of Web sites • Small portion (< than 1%) of Web pages • Remember that recall (completeness) is not important • Directories point to logical web sites rather than pages • Directory search returns both categories and sites • People generally browse rather than search once they identify categories of interest

Metasearch • Search a number of search engines • Advantages • Do not build their own crawler and index • Cover more of the Web than any of their component search engines • Difficulties • Need to translate query to each engine query language • Need to merge results into a meaningful ranking

Metasearch II • Merging Results • Voting scheme based on component search engines • No model of component ranking schemes needed • Model-based merging • Need understanding of relative ranking, potentially by query type • Why they are not used for the Web • Bias towards coverage (e.g. recall), which is not important for most Web queries • Merging results is largely ad-hoc, so search engines tend to do better • Big application: the Dark Web

Using Structure in Search • Languages to search content and structure • Query languages over labeled graphs • PHIQL: Used in Microplis and PHIDIAS hypertext systems • Web-oriented: W3QL, WebSQL, WebLog, WQL

Using Structure in Search • Other use of structure in search • Relevant pages have neighbors that also tend to be relevant • Search approaches that collect (and filter) neighbors to returned pages

Web Query Characteristics • Few terms and operators • Average 2.35 terms per query • 25% of queries have a single term • Average 0.41 operators per query • Queries get repeated • Average 3.97 instances of each query • This is very uneven (e.g. “Britney Spears” vs. “Frank Shipman”) • Query sessions are short • Average 2.02 queries per session • Average of 1.39 pages of results examined • Data from 1998 study • How different today?

Searching the Web