Search Engines & Question Answering

Search Engines & Question Answering Giuseppe Attardi Università di Pisa

Topics • Web Search • Search engines • Architecture • Crawling: parallel/distributed, focused • Link analysis (Google PageRank) • Scaling

Top Online Activities Source: Jupiter Communications, 2000

Pew Study (US users July 2002) • Total Internet users = 111 M • Do a search on any given day = 33 M • Have used Internet to search = 85% //www.pewinternet.org/reports/toc.asp?Report=64

Tampere weather Mars surface images Nikon CoolPix Search on the Web • Corpus:The publicly accessible Web: static + dynamic • Goal: Retrieve high quality results relevant to the user’s need • (not docs!) • Need • Informational – want to learn about something (~40%) • Navigational – want to go to that page (~25%) • Transactional – want to do something (web-mediated) (~35%) • Access a service • Downloads • Shop • Gray areas • Find a good hub • Exploratory search “see what’s there” Low hemoglobin United Airlines Car rental Finland

Results • Static pages (documents) • text, mp3, images, video, ... • Dynamic pages = generated on request • data base access • “the invisible web” • proprietary content, etc.

Terminology http://www.cism.it/cism/hotels_2001.htm URL = Universal Resource Locator Access method Host name Page name

Scale • Immense amount of content • 2-10B static pages, doubling every 8-12 months • Lexicon Size: 10s-100s of millions of words • Authors galore (1 in 4 hosts run a web server) http://www.netcraft.com/Survey

Arts 14.6% Arts: Music 6.1% Computers 13.8% Regional: North America 5.3% Regional 10.3% Adult: Image Galleries 4.4% Society 8.7% Computers: Software 3.4% Adult 8% Computers: Internet 3.2% Recreation 7.3% Business: Industries 2.3% Business 7.2% Regional: Europe 1.8% … … … … Diversity • Languages/Encodings • Hundreds (thousands ?) of languages, W3C encodings: 55 (Jul01) [W3C01] • Home pages (1997): English 82%, Next 15: 13% [Babe97] • Google (mid 2001): English: 53%, JGCFSKRIP: 30% • Document & query topic Popular Query Topics (from 1 million Google queries, Apr 2000)

Rate of change [Cho00] 720K pages from 270 popular sites sampled daily from Feb 17 – Jun 14, 1999 Mathematically, what does this seem to be?

Web idiosyncrasies • Distributed authorship • Millions of people creating pages with their own style, grammar, vocabulary, opinions, facts, falsehoods … • Not all have the purest motives in providing high-quality information - commercial motives drive “spamming” - 100s of millions of pages. • The open web is largely a marketing tool. • IBM’s home page does not contain computer.

Other characteristics • Significant duplication • Syntactic - 30%-40% (near) duplicates [Brod97, Shiv99b] • Semantic - ??? • High linkage • ~ 8 links/page in the average • Complex graph topology • Not a small world; bow-tie structure [Brod00] • More on these corpus characteristics later • how do we measure them?

Ill-defined queries Short AV 2001: 2.54 terms avg, 80% < 3 words) Imprecise terms Sub-optimal syntax (80% queries without operator) Low effort Wide variance in Needs Expectations Knowledge Bandwidth Specific behavior 85% look over one result screen only (mostly above the fold) 78% of queries are not modified (one query/session) Follow links – “the scent of information” ... Web search users

Evolution of search engines 1995-1997 AV, Excite, Lycos, etc • First generation -- use only “on page”, text data • Word frequency, language • Second generation -- use off-page, web-specific data • Link (or connectivity) analysis • Click-through data (What results people click on) • Anchor-text (How people refer to this page) • Third generation -- answer “the need behind the query” • Semantic analysis -- what is this about? • Focus on user need, rather than on query • Context determination • Helping the user • Integration of search and text analysis From 1998. Made popular by Google but everyone now Still experimental

Third generation search engine: answering “the need behind the query” • Query language determination • Different ranking • (if query Japanese do not return English) • Hard & soft matches • Personalities (triggered on names) • Cities (travel info, maps) • Medical info (triggered on names and/or results) • Stock quotes, news (triggered on stock symbol) • Company info, … • Integration of Search and Text Analysis

Answering “the need behind the query”Context determination • Context determination • spatial (user location/target location) • query stream (previous queries) • personal (user profile) • explicit (vertical search, family friendly) • implicit (use AltaVista from AltaVista France) • Context use • Result restriction • Ranking modulation

The spatial context - geo-search • Two aspects • Geo-coding • encode geographic coordinates to make search effective • Geo-parsing • the process of identifying geographic context. • Geo-coding • Geometrical hierarchy (squares) • Natural hierarchy (country, state, county, city, zip-codes, etc) • Geo-parsing • Pages (infer from phone nos, zip, etc). About 10% feasible. • Queries (use dictionary of place names) • Users • From IP data • Mobile phones • In its infancy, many issues (display size, privacy, etc)

AV barry bonds

Lycos palo alto

Geo-search example - Northern Light (Now Divine Inc)

Helping the user • UI • spell checking • query refinement • query suggestion • context transfer …

Context sensitive spell check

Search Engine Architecture Document Store Web Page Repository Link Analysis Structure Snippet Extraction Indexer Ranking Crawlers Results Text Query Engine Crawl Control Queries

Terms • Crawler • Crawler control • Indexes – text, structure, utility • Page repository • Indexer • Collection analysis module • Query engine • Ranking module

Repository “Hidden Treasures”

Storage • The page repository is a scalable storage system for web pages • Allows the Crawler to store pages • Allows the Indexer and Collection Analysis to retrieve them • Similar to other data storage systems – DB or file systems • Does nothave to provide some of the other systems’ features: transactions, logging, directory

Storage Issues • Scalability and seamless load distribution • Dual access modes • Random access (used by the query engine for cached pages) • Streaming access (used by the Indexer and Collection Analysis) • Large bulk update – reclaim old space, avoid access/update conflicts • Obsolete pages - remove pages no longer on the web

Designing a Distributed Web Repository • Repository designed to work over a cluster of interconnected nodes • Page distribution across nodes • Physical organization within a node • Update strategy

Page Distribution • How to choose a node to store a page • Uniform distribution – any page can be sent to any node • Hash distribution policy – hash page ID space into node ID space

Organization Within a Node • Several operations required • Add / remove a page • High speed streaming • Random page access • Hashed organization • Treat each disk as a hash bucket • Assign according to a page’s ID • Log organization • Treat the disk as one file, and add the page at the end • Support random access using a B-tree • Hybrid • Hash map a page to an extent and use log structure within an extent.

Distribution Performance

Update Strategies • Updates are generated by the crawler • Several characteristics • Time in which the crawl occurs and the repository receives information • Whether the crawl’s information replaces the entire database or modifies parts of it

Batch vs. Steady • Batch mode • Periodically executed • Allocated a certain amount of time • Steady mode • Run all the time • Always send results back to the repository

Partial vs. Complete Crawls • A batch mode crawler can • Do a complete crawl every run, and replace entire collection • Recrawl only a specific subset, and apply updates to the existing collection – partial crawl • The repository can implement • In place update • Quickly refresh pages • Shadowing, update as another stage • Avoid refresh-access conflicts

Partial vs. Complete Crawls • Shadowing resolves the conflicts between updates and read for the queries • Batch mode suits well with shadowing • Steady crawler suits with in place updates

Indexing

The Indexer Module Creates Two indexes : • Text (content) index : Uses “Traditional” indexing methods like Inverted Indexing • Structure(Links( index : Uses a directed graph of pages and links. Sometimes also creates an inverted graph

The Link Analysis Module Uses the 2 basic indexes created by the indexer module in order to assemble “Utility Indexes” e.g. : A site index.

Inverted Index • A Set of inverted lists, one per each index term (word) • Inverted list of a term: A sorted list of locations in which the term appeared. • Posting: A pair (w,l) where w is word and l is one of its locations • Lexicon: Holds all index’s terms with statistics about the term (not the posting)

Challenges • Index build must be: • Fast • Economic (unlike traditional index buildings) • Incremental Indexing must be supported • Storage: compression vs. speed

Index Partitioning A distributed text indexing can be done by: • Local inverted file (IFL) • Each nodes contain disjoint random pages • Query is broadcasted • Result is the joined query answers • Global inverted file (IFG) • Each node is responsible only for a subset of terms in the collection • Query sent only to the apropriate node

Indexing, Conclusion • Web pages indexing is complicated due to it’s scale (millions of pages, hundreds of gigabytes) • Challenges: Incremental indexing and personalization

Scaling

Scaling • Google (Nov 2002): • Number of pages: 3 billion • Refresh interval: 1 month (1200 pag/sec) • Queries/day: 150 million = 1700 q/s • Avg page size:10KB • Avg query size: 40 B • Avg result size: 5 KB • Avg links/page: 8

Size of Dataset • Total raw HTML data size: 3G x 10 KB = 30 TB • Inverted index ~= corpus = 30 TB • Using compression 3:1 20 TB data on disk

Single copy of index • Index • (10 TB) / (100 GB per disk) = 100 disk • Document • (10 TB) / (100 GB per disk) = 100 disk

Query Load • 1700 queries/sec • Rule of thumb: 20 q/s per CPU • 85 clusters to answer queries • Cluster: 100 machine • Total = 85 x 100 = 8500 • Document server • Snippet search: 1000 snippet/s • (1700 * 10 / 1000) * 100 = 1700

Limits • Redirector: 4000 req/sec • Bandwidth: 1100 req/sec • Server: 22 q/s each • Cluster: 50 nodes = 1100 q/s = 95 million q/day

Queries Hardware based load balancer Index servers Document servers Google Web Server Google Web Server Ad server Spell Checker Google Web Server Scaling the Index

Web Server 1 Gb/s 100 Mb/s Index load balancer Index Server Network … Index Server 1 Index Server K Pool Network Intermediate load balancer 1 Intermediate load balancer N … … … SIS SIS SIS SIS SN SN S1 S1 Pool for shard 1 Pool for shard N Pooled Shard Architecture

Search Engines & Question Answering