Crawling

Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3

Spidering • 24h, 7days “walking” over a Graph • What about the Graph? • BowTie • Direct graph G = (N, E) • N changes (insert, delete) >> 50 * 109 nodes • E changes (insert, delete) > 10 links per node • 10*50*109 = 500*109 1-entries in adj matrix

Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns (Minimize load) • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How to parallelize the process

Page selection • Given a page P, define how “good” P is. • Several metrics: • BFS, DFS, Random • Popularity driven (PageRank, full vs partial) • Topic driven or focused crawling • Combined

This page is a new one ? • Check if file has been parsed or downloaded before • after 20 mil pages, we have “seen” over 200 million URLs • each URL is at least 100 bytes on average • Overall we have about 20Gb of URLS • Options: compress URLs in main memory, or use disk • Bloom Filter (Archive) • Disk access with caching (Mercator, Altavista)

Crawler Manager AR Link Extractor Downloaders PQ Crawler “cycle of life” PR Link Extractor: while(<Page Repository is not empty>){ <take a page p (check if it is new)> <extract links contained in p within href> <extract links contained in javascript> <extract ….. <insert these links into the Priority Queue> } Downloaders: while(<Assigned Repository is not empty>){ <extract url u> <download page(u)> <send page(u) to the Page Repository> <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while(<Priority Queue is not empty>){ <extract some URL u having the highest priority> foreach u extracted { if ( (u  “Already Seen Page” ) || ( u  “Already Seen Page” && <u’s version on the Web is more recent> ) ) { <resolve u wrt DNS> <send u to the Assigned Repository> } } }

Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication • Dynamic assignment • Central coordinator dynamically assigns URLs to crawlers • Links are given to Central coordinator (?bottleneck?) • Static assignment • Web is statically partitioned and assigned to crawlers • Crawler only crawls its part of the web

Two problems Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x • Load balancing the #URLs assigned to downloaders: • Static schemes based on hosts may fail • www.geocities.com/…. • www.di.unipi.it/ • Dynamic “relocation” schemes may be complicated • Managing the fault-tolerance: • What about the death of downloaders ? DD-1, new hash !!! • What about new downloaders ? DD+1, new hash !!!

Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times... A nice technique: Consistent Hashing • Item andservers mapped to unit circle • Item K assigned to first server N such that ID(N) ≥ ID(K) • What if a downloader goes down? • What if a new downloader appears? • A tool for: • Spidering • Web Cache • P2P • Routers Load Balance • Distributed FS

Examples: Open Source • Nutch, also used by WikiSearch • http://www.nutch.org • Hentrix, used by Archive.org • http://archive-crawler.sourceforge.net/index.html • Consisten Hashing • Amazon’s Dynamo

Ranking Link-based Ranking (2° generation) Reading 21

Query-independent ordering • First generation: using link counts as simple measures of popularity. • Undirected popularity: • Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). • Directed popularity: • Score of a page = number of its in-links (es. 3). Easy to SPAM

Second generation: PageRank • Each link has its own importance!! • PageRank is • independent of the query • many interpretations…

Basic Intuition… What about nodes with no in/out links?

Principal eigenvector Google’s Pagerank Random jump r = [ aPT + (1- a) e eT ] × r B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}.

Any node 1-a “In the steady state” each page has a long-term visit rate - use this as the page’s score. a Neighbors Three different interpretations • Graph (intuitive interpretation) • Co-citation • Matrix (easy for computation) • Eigenvector computation or a linear system solution • Markov Chain (useful to prove convergence) • a sort of Usage Simulation

Pagerank: use in Search Engines • Preprocessing: • Given graph, build matrix • Compute its principal eigenvector r • r[i] is the pagerank of page i We are interested in the relative order • Query processing: • Retrieve pages containing query terms • Rank them by their Pagerank The final order is query-independent aPT + (1- a) e eT

HITS:Hypertext Induced Topic Search

Calculating HITS • It is query-dependent • Produces two scores per page: • Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. • Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Authority and Hub scores 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)

HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij Symmetric matrices Thus, h is an eigenvector of AAt a is an eigenvector of AtA

Weighting links Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).

Latent Semantic Indexing(mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18

Speeding up cosine computation • What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? • Now, O(nm) • Then, O(km+kn) where k << n,m • Two methods: • “Latent semantic indexing” • Random projection

A sketch • LSI is data-dependent • Create a k-dim subspace by eliminating redundant axes • Pull together “related” axes – hopefully • car and automobile • Random projection is data-independent • Choose a k-dim subspace that guarantees good stretching properties with high probability between pair of points. What about polysemy ?

Notions from linear algebra • Matrix A, vector v • Matrix transpose (At) • Matrix product • Rank • Eigenvalues l and eigenvector v: Av = lv

Overview of LSI • Pre-process docs using a technique from linear algebra called Singular Value Decomposition • Create a new (smaller) vector space • Queries handled (faster) in this new space

Singular-Value Decomposition • Recall mn matrix of terms  docs, A. • A has rank r  m,n • Define term-term correlation matrix T=AAt • T is a square, symmetric mm matrix • Let P be mrmatrix of eigenvectors of T • Define doc-doc correlation matrix D=AtA • D is a square, symmetric nn matrix • Let R be nrmatrix of eigenvectors of D

A’s decomposition • Do exist matrices P(for T, mr) and R(for D, nr) formed by orthonormal columns (unit dot-product) • It turns out that A = PSRt • WhereS is a diagonal matrix with the eigenvalues of T=AAt in decreasing order. mn mr rn = rr Rt S P A

document k k 0 k 0 0 useless due to 0-col/0-row of Sk Dimensionality reduction • For some k << r, zeroout all but the k biggest eigenvalues in S[choice of k is crucial] • Denote by Sk this new version of S, having rank k • Typically k is about 100, while r (A’s rank) is > 10,000 = r Sk Rt S P A Ak k x n r x n m x r m x k

Guarantee • Akis a pretty good approximation to A: • Relative distances are (approximately) preserved • Of all mn matrices of rank k, Ak is the best approximation to A wrt the following measures: • minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = sk+1 • minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = sk+12+ sk+22+...+ sr2 • Frobenius norm ||A||F2 = s12+ s22+...+ sr2

R,P are formed by orthonormal eigenvectors of the matrices D,T Reduction • Xk = Sk Rtis the doc-matrix k x n, hence reduced to k dim • Take the doc-correlation matrix: • It is D=AtA=(P SRt)t(P SRt) = (SRt)t (SRt) • Approx S with Sk, thus get AtA Xkt Xk (both are n x n matr.) • We use Xkto define A’s projection: • Xk= SkRt , substitute Rt = S-1Pt A, so getPkt A. • In fact, SkS-1Pt = Pkt which is a k x m matrix • This means that to reduce a doc/query vector is enough to multiply it by Pkt • Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)

Which are the concepts ? • c-th concept = c-th row of Pkt (which is k x m) • Denote it by Pkt [c], whose size is m = #terms • Pkt [c][i] = strength of association between c-th concept and i-th term • Projected document: d’j= Pkt dj • d’j[c] = strenght of concept c in dj • Projected query: q’ = Pkt q • q’[c] = strenght of concept c in q

Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !

An interesting math result d is our previous m = #terms Setting v=0 we also get a bound on f(u)’s stretching!!!

What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above

E[ri,j] = 0 Var[ri,j] = 1 A practical-theoretical idea !!!

Finally... • Random projections hide large constants • k  (1/e)2 * log d, so it may be large… • it is simple and fast to compute • LSI is intuitive and may scale to any k • optimal under various metrics • but costly to compute

Document duplication(exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!

Sec. 19.6 Duplicate documents • The web is full of duplicated content • Few exact duplicate detection • Many cases of nearduplicates • E.g., Last modified date the only difference between two copies of a page

Natural Approaches • Fingerprinting: • only works for exact matches • Random Sampling • sample substrings (phrases, sentences, etc) • hope: similar documents  similar samples • But – even samples of same document will differ • Edit-distance • metric for approximate string-matching • expensive – even for one pair of strings • impossible – for 1032 web documents

Exact-Duplicate Detection • Obvious techniques • Checksum – no worst-case collision probability guarantees • MD5 – cryptographically-secure string hashes • relatively slow • Karp-Rabin’s Scheme • Algebraic technique – arithmetic on primes • Efficient and other nice properties…

Karp-Rabin Fingerprints • Consider – m-bit string A=a1 a2 … am • Assume – a1=1 and fixed-length strings (wlog) • Basic values: • Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 264) • Set h = dm-1 mod p • Fingerprints: f(A) = A mod p • Nice property is that if B = a2 … am am+1 • f(B) = [d (A - a1 h) + am+1 ] mod p • Prob[false hit] = Prob p divides (A-B) = #div(A-B)/U ≈ (log (A+B)) / U = m/U

Near-Duplicate Detection • Problem • Given a large collection of documents • Identifythe near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • Mistaken – spider errors • 30% of web-pages are near-duplicates [1997]

Desiderata • Storage: only small sketchesof each document. • Computation:the fastest possible • Stream Processing: • once sketch computed, source is unavailable • Error Guarantees • problem scale  small biases have large impact • need formal guarantees – heuristics will not do

Basic Idea[Broder 1997] • Shingling • dissect document into q-grams(shingles) • represent documents by shingle-sets • reduce problem to set intersection [ Jaccard ] • They are near-duplicates if large shingle-sets intersect enough • We know how to cope with “Set Intersection” • fingerprints of shingles (for space efficiency) • min-hash to estimate intersections sizes (for time and space efficiency)

Multiset of Shingles Multiset of Fingerprints Doc shingling fingerprint Documents  Sets of 64-bit fingerprints • Fingerprints: • Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) • Fingerprint space [0, …, U-1] • In practice, use 64-bit fingerprints, i.e., U=264 • Prob[collision] ≈ (8q)/264 << 1

SA SB Similarity of Documents Doc A Doc B • Jaccard measure – similarity of SA, SB  U = [0 … N-1] • Claim: A & B are near-duplicates if sim(SA,SB) is high

Sec. 19.6 Speeding-up: Sketch of a document • Intersecting directly the shingles is too costly • Create a “sketch vector” (of size ~200) for each document • Documents that share ≥t(say 80%) corresponding vector elements are near duplicates

Sketching by Min-Hashing • Consider • SA, SB P • Pick a random permutation π of P (such as ax+b mod |P|) • Define = π -1( min{π(SA)} ) , b = π -1( min{π(SB)} ) • minimal element under permutation π • Lemma:

Crawling

Crawling

Presentation Transcript

FOCUSED CRAWLING

Web Crawling

Web Crawling

Web Crawling

Web Crawling

Crawling

CRAWLING THE WEB

Crawling

Crawling and Ranking

Crawling HTML

Crawling

Some crawling algorithms

Adaptive Focused Crawling

CRAWLING

Adaptive Focused Crawling

Crawling

Flying/Crawling Wires

Crawling

Datahut - Web Crawling Services

Crawling and Ranking

Ch. 8: Web Crawling

Deep Web Crawling