Demystifying Web Graph Spidering, Indexing, and Ranking Technologies

Crash Course Web Graph Spidering Indexing Ranking Antonio Gulli University of Pisa

A technology with large audience • Web Search is used by more than 400M people/day • There are more than 8billions pages • Google IPO is estimated 10-15billion $ • A LOT of Computer Science • Information Retrieval • Algorithms & Data Structures • Numeric Analysis • Parallel & Distributed Computation • ….. and many others

AGENDA • Web Graph • Google Overview • Overview of Spidering Technology … hey, how can I get that page? • Overview of Indexing Technology … hey, how can I remember that page? • Overview of Ranking Technology … hey, how can I order those pages?

A Picture of the Web Graph [BRODER, www9]

A Picture of the Web Graph [Ravaghan, www9]

A Picture of the Web Graph Berkeley Stanford [Hawelivala, www12]

A Picture of the Web Graph Rimpiazzare con definitiva [DelCorso, Gulli, Romani .. Work in Progress]

The Web’s Characteristics • Size • Over a billion pages available • 5-10K per page => tens of terabytes • Size doubles every 2 years • Change • 23% change daily • Half life time of about 10 days • Poisson model for changes • Bowtie structure

Page Repository Web Indexer Collection Analysis Queries Results Crawlers Query Engine Ranking Text Structure Utility Crawl Control Indexes Search Engine Structure

Google: Scale • Number of pages indexed: 3B in November 2002 • Index refresh interval: Once per month ~ 1200 pages/sec • Number of queries per day: 200M in April 2003 ~ 2000 queries/sec • Runs on commodity Intel-Linux boxes [Cho, 02]

Google:Other Statistics • Average page size: 10KB • Average query size: 40B • Average result size: 5KB • Average number of links per page: 10 • Total raw HTML data size 3G x 10KB = 30 TB! • Inverted index roughly the same size as raw corpus: 30 TB for index itself • With appropriate compression, 3:1 • 20 TB data residing in disk (and memory!!!)

Google:Data Size and Crawling • Efficient crawl is very important • 1 page/sec  1200 machines just for crawling • Parallelization through thread/event queue necessary • Complex crawling algorithm -- No, No! • Well-optimized crawler • ~ 100 pages/sec (10 ms/page) • ~ 12 machines for crawling • Bandwidth consumption • 1200 x 10KB x 8bit ~ 100Mbps • One dedicated OC3 line (155Mbps) for crawling ~ $400,000 per year

Google: Data Size, Query Processing • Index size: 10TB  100 disks • Typically less than 5 disks per machine • Potentially 20-machine cluster to answer a query • If one machine goes down, the cluster goes down • Two-tier index structure can be helpful • Tier 1: Popular (high PageRank) page index • Tier 2: Less popular page index • Most queries can be answered by tier-1 cluster (with fewer machines)

Google: Implication of Query Load • 2000 queries / sec • Rule of thumb: 1 query / sec per CPU • Depends on number of disks, memory size, etc. • ~ 2000 machines just to answer queries • 5KB / answer page • 2000 x 5KB x 8bit ~ 80 Mbps • Half dedicated OC3 line (155Mbps) ~ $300,000

Google: Query Load and Replication • Index replication necessary to handle the query load • Assuming 1TB tier-1 index, 100Mbit/sec transfer rate • 8bits x 1TB / 100MB = 80,000 sec • One day to refresh to a new index • Of course, need to verify the transferred data before using it…

Google: Hardware • 50,000 Intel-Linux cluster • Assuming 99.9% uptime (8 hour downtime per year) • 50 machines are always down • Nightmare for system administrators • Assuming 3-year hardware replacement • Set up, replace and dump 50 machines every day • Heterogeneity is unavoidable

ROADMAP • What we have seen so far • Web Graph • Search Engine Architecture • Google Overview • Next ? • We will focus onSPIDERING

Crawling web pages • What pages to download • When to refresh • Minimize load on web sites • How to parallelize the process

Bubble ???

Crawler “cycle of life” Downloaders: while(<ci sono url assegnate dai crawler manager>){ <estrai le url dalla coda di assegnamento> <scarica le pagine piassociate alla url dalla rete> <invia le pi al page repository> } Link Extractor: while(<ci sono pagine da cui estrarre i link>){ <prendi una pagina p dal page repository> <estrai i link contenuti nel tag a href> <estrai i link contenuti in javascript> <estrai ….. <estrai i link contenuti nei frameset> <inserisci i link estratti nella priority que, ciascuna con una priorità dipendente dalla politica scelta e: 1) compatibilmente ai filtri applicati 2) applicando le operazioni di normalizzazione> <marca p come pagina da cui abbiamo estratto i link> } Crawler Manager: <estrai un bunch di url dalla “priority que” in ordine> while(<ci sono url assegnate dai crawler manager>){ <estrai le URL ed assegnale ad S> foreach u  S { if ( (u  “Already Seen Page” ) || ( u  “Already Seen Page” && (<sul Web server la pagina è più recente> ) && ( <u è un url accettata dal robot.txt del sito>) ) { <risolvi u rispetto al DNS> <invia u ai downloaders, in coda> } }

DNS Revolvers Strutture Dati DNS Cache Parallel Downloaders Moduli Software Already Seen Pages Parsers Parallel Crawler Managers Priority Que Robot.txt Cache Parallel Link Extractors Distributed Page Repository SPIDERS Architecture of Incremental Crawler INTERNET LEGENDA … Indexer … … Page Analysis INDEXERS [Gulli, 98]

Page selection • Crawler method for choosing page to download • Given a page P, define how “good” that page is. • Several metric types: • Interest driven • Popularity driven • BFS, DFS, Random • Combined

Interest Driven • Define a driving query Q • Find textual similarity between P and Q • Define a word vocabulary W1…Wn • Define a vector for P and Q: • Vp, Vq = <W1,…,Wn> • Wi = 0 if Wi does not appear in the document • Wi = Inverse document frequency otherwise • IDF(Wi) = 1 / number of appearances in the entire collection • Importance: IS(P) = P * Q (cosine product) • Finding IDF requires going over the entire web • Estimate IDF by pages already visited, to calculate IS’

Popularity Driven • How popular a page is: • Backlink count • IB(P) – the number of pages containing a link to P • Estimat by pervious crawls: IB’(P) • More sophisticated metric, called PageRank

BFS • “…breadth-first search order discovers the highest quality pages during the early stages of the crawl BFS” 328 milioni di URL nel testbed [Najork 01]

WebBase Results [Cho 01]

Refresh Strategy • Crawlers can refresh only a certain amount of pages in a period of time. • The page download resource can be allocated in many ways • The proportional refresh policy allocated the resource proportionally to the pages’ change rate.

Focused Crawling • Focused Crawler: selectively seeks out pages that are relevant to a pre-defined set of topics. • Topics specified by using exemplary documents (not keywords) • Crawl most relevant links • Ignore irrelevant parts. • Leads to significant savings in hardware and network resources.

Focused Crawling • il teorema di Bayes stima la probabilità condizionale che si verifichi l’evento Hi in presenza dell’evento E: • Pr[documento rilevante | il termine t è presente] • Pr[documento irrilevante | il termine t è presente] • Pr[termine t sia presente | il doc sia rilevante] • Pr[termine t sia presente | il doc sia irrilevante]

Parallel Crawlers • Web is too big to be crawled by a single crawler, work should be divided • Independent assignment • Each crawler starts with its own set of URLs • Follows links without consulting other crawlers • Reduces communication overhead • Some overlap is unavoidable

Parallel Crawlers • Dynamic assignment • Central coordinator divides web into partitions • Crawlers crawl their assigned partition • Links to other URLs are given to Central coordinator • Static assignment • Web is partitioned and divided to each crawler • Crawler only crawls its part of the web

URL-Seen Problem • Need to check if file has been parsed or downloaded before - after 20 million pages, we have “seen” over 100 million URLs - each URL is 50 to 75 bytes on average • Options: compress URLs in main memory, or use disk - Bloom Filter (Archive) - disk access with caching (Mercator, Altavista)

An example of crawler Polybot • crawl of 120 million pages over 19 days 161 million HTTP request 16 million robots.txt requests 138 million successful non-robots requests 17 million HTTP errors (401, 403, 404 etc) 121 million pages retrieved • slow during day, fast at night • peak about 300 pages/s over T3 • many downtimes due to attacks, crashes, revisions • http://cis.poly.edu/polybot/ [Suel 02]

Pagina Web e Documento Virtuale bush White House Indicizzare ciò che non si è raccolto • Supponiamo di non avere raggiunto la pagina P “whitehouse.org”, ma di avere già raggiunto ed indicizzato un insieme di pagine {P1….Pr} che puntano P • Supponiamo di estrarre dal link che da ciascun Pi, 1<i<r, punta P una finestra di testo. • …George Bush, President of U.S. lives at <a href=http://www.whitehouse.org> WhiteHouse</a> • … George Washington was at <a href=http://whitehouse.org> WhiteHouse</a> ES: Madonna On Google Washington

Examples: Open Source • Nutch, also used by Overture • http://www.nutch.org • Hentrix, used by Archive.org • http://archive-crawler.sourceforge.net/index.html

What we have seen so far • Web Graph • Search Engine Architecture • Google Overview • Spidering • Next ? • We will focus on INDEX DATA STRUCTURE

The Indexer Module Creates Two indexes : • Text (content) index : Uses “Traditional” indexing methods like Inverted Indexing. • Structure(Links( index : Uses a directed graph of pages and links. Sometimes also creates an inverted graph.

Text Inverted Index • A Set of inverted lists, one per each index term (word). • Inverted list of a term: A sorted list of locations in which the term appeared. • Posting : A pair (w,l) where w is word and l is one of its locations. • Lexicon : Holds all index’s terms with statistics about the term (not the posting)

Word IDF Document TF 1/3530 Stanford D1 2 Lexicon Postingslist 1/9860 UCLA D14 30 1/937 8 MIT D376 … (TF may be normalized by document size) Text Inverted Index • Google sorts more than 100 B terms in its index.

Google 98: Text Inverted Index Lexicon: fin in memory two different forms Hit list: account for most space use 2 bytes to save space Forward index: barrels are sorted by wordID. Inside barrel, sorted by docID Inverted Index: some content as the forward index, but sorted by wordID. doc list is sorted by docID

Google 98: Text Inverted Index • Each docID is associated with a list of hits - these describe the verbal information in a page. • “Hand” optimized compact encoding • Plain Hits • word occurrences in the main page • relative font size, position(12 bits), capitalization • Fancy Hits • URL, title, anchor, META-tag • denoted by font size setting • plain’s 12 position bits used differently • 4 bits for the type of fancy hits

Google 98: Text Inverted Index • Minimise disk seek bottlenecks • Repository • contains full HTML for each crawled page • time favoured over space for the compression algorithm • Document Index • holds document ids for all crawled and uncrawled URLs • feeds uncrawled URLS to the URL Server • batch conversion of URLS into DOCIDs to minimize disk seeks

Text Index Partitioning A distributed text indexing can be done by : • Local inverted file(IFL) • Each nodes contain disjoint random pages. • Query is broadcasted. • Result is the joined query answers. • Global inverted file (IFG) • Each node is responsible only for a subset of terms in the collection. • Query sent only to the apropriate node BETTER??

#Outbound link #Inbound link Inbound pages … … … … … Page 0 1 2 3 . . . 4 n Link Index (Web Graph) 2 1 0 3 4

Challenges • Index build must be : • Fast • Economic (unlike traditional index buildings) • Incremental Indexing must be supported • Storage : compression vs. speed

Indexing, Conclusion • Web pages indexing is complicated due to it’s scale (millions of pages, hundreds of gigabytes). • Challenges : Incremental indexing and personalization.

ROADMAP • What we have seen so far • Web Graph • Search Engine Architecture • Google Overview • Spidering • Index Data Structure • Next ? • We will focus on Ranking & Social Networks

Traditional Ranking Faults • Many pages containing a term may be of poor quality or not relevant. • TFIDF (Term frequency inverse document frequency) vector and cosine similarity • Insufficient self description vs. spamming. • Not using link analysis.

Traditional Ranking Faults • TF (Term frequency): number of times that a word occurs in a document • IDF (Inverse document frequency): inverse of the number of documents containing the word

Search engine e.g., using: tf*idf formula Vector Space Model Ranking by descending relevance Similarity metric: Query (Set of weighted features) Documents are feature vectors

Demystifying Web Graph Spidering, Indexing, and Ranking Technologies

Demystifying Web Graph Spidering, Indexing, and Ranking Technologies

Presentation Transcript

Crash Course

Crash Course

G.R.E. Crash Course

Company Crash Course

Crash Course videos

C++ crash course

Crash Course Review

Exploitation Crash Course

A Crash Course

CRASH COURSE:

C++ crash course

Crash Course!

HTML Crash Course

Poetry Crash Course

CTPP Crash Course

CTPP Crash Course

Crash Course

CSS Crash Course

SPANISH CRASH COURSE

Anthropology Crash Course

Grammar Crash Course!!

Crash Course!