550 likes | 701 Views
Crawling. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading 20.1, 20.2 and 20.3. Spidering. 24h, 7days “walking” over a Graph What about the Graph? BowTie Direct graph G = (N, E) N changes (insert, delete) >> 50 * 10 9 nodes
E N D
Crawling Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 20.1, 20.2 and 20.3
Spidering • 24h, 7days “walking” over a Graph • What about the Graph? • BowTie • Direct graph G = (N, E) • N changes (insert, delete) >> 50 * 109 nodes • E changes (insert, delete) > 10 links per node • 10*50*109 = 500*109 1-entries in adj matrix
Crawling Issues • How to crawl? • Quality: “Best” pages first • Efficiency: Avoid duplication (or near duplication) • Etiquette: Robots.txt, Server load concerns (Minimize load) • How much to crawl? How much to index? • Coverage: How big is the Web? How much do we cover? • Relative Coverage: How much do competitors have? • How often to crawl? • Freshness: How much has changed? • How to parallelize the process
Page selection • Given a page P, define how “good” P is. • Several metrics: • BFS, DFS, Random • Popularity driven (PageRank, full vs partial) • Topic driven or focused crawling • Combined
This page is a new one ? • Check if file has been parsed or downloaded before • after 20 mil pages, we have “seen” over 200 million URLs • each URL is at least 100 bytes on average • Overall we have about 20Gb of URLS • Options: compress URLs in main memory, or use disk • Bloom Filter (Archive) • Disk access with caching (Mercator, Altavista)
Crawler Manager AR Link Extractor Downloaders PQ Crawler “cycle of life” PR Link Extractor: while(<Page Repository is not empty>){ <take a page p (check if it is new)> <extract links contained in p within href> <extract links contained in javascript> <extract ….. <insert these links into the Priority Queue> } Downloaders: while(<Assigned Repository is not empty>){ <extract url u> <download page(u)> <send page(u) to the Page Repository> <store page(u) in a proper archive, possibly compressed> } Crawler Manager: while(<Priority Queue is not empty>){ <extract some URL u having the highest priority> foreach u extracted { if ( (u “Already Seen Page” ) || ( u “Already Seen Page” && <u’s version on the Web is more recent> ) ) { <resolve u wrt DNS> <send u to the Assigned Repository> } } }
Parallel Crawlers Web is too big to be crawled by a single crawler, work should be divided avoiding duplication • Dynamic assignment • Central coordinator dynamically assigns URLs to crawlers • Links are given to Central coordinator (?bottleneck?) • Static assignment • Web is statically partitioned and assigned to crawlers • Crawler only crawls its part of the web
Two problems Let D be the number of downloaders. hash(URL) maps an URL to {0,...,D-1}. Dowloader x fetches the URLs U s.t. hash(U) = x • Load balancing the #URLs assigned to downloaders: • Static schemes based on hosts may fail • www.geocities.com/…. • www.di.unipi.it/ • Dynamic “relocation” schemes may be complicated • Managing the fault-tolerance: • What about the death of downloaders ? DD-1, new hash !!! • What about new downloaders ? DD+1, new hash !!!
Each server gets replicated log S times [monotone] adding a new server moves points between one old to the new, only. [balance] Prob item goes to a server is ≤ O(1)/S [load] any server gets ≤ (I/S) log S items w.h.p [scale] you can copy each server more times... A nice technique: Consistent Hashing • Item andservers mapped to unit circle • Item K assigned to first server N such that ID(N) ≥ ID(K) • What if a downloader goes down? • What if a new downloader appears? • A tool for: • Spidering • Web Cache • P2P • Routers Load Balance • Distributed FS
Examples: Open Source • Nutch, also used by WikiSearch • http://www.nutch.org • Hentrix, used by Archive.org • http://archive-crawler.sourceforge.net/index.html • Consisten Hashing • Amazon’s Dynamo
Ranking Link-based Ranking (2° generation) Reading 21
Query-independent ordering • First generation: using link counts as simple measures of popularity. • Undirected popularity: • Each page gets a score given by the number of in-links plus the number of out-links (es. 3+2=5). • Directed popularity: • Score of a page = number of its in-links (es. 3). Easy to SPAM
Second generation: PageRank • Each link has its own importance!! • PageRank is • independent of the query • many interpretations…
Basic Intuition… What about nodes with no in/out links?
Principal eigenvector Google’s Pagerank Random jump r = [ aPT + (1- a) e eT ] × r B(i) : set of pages linking to i. #out(j) : number of outgoing links from j. e : vector of components 1/sqrt{N}.
Any node 1-a “In the steady state” each page has a long-term visit rate - use this as the page’s score. a Neighbors Three different interpretations • Graph (intuitive interpretation) • Co-citation • Matrix (easy for computation) • Eigenvector computation or a linear system solution • Markov Chain (useful to prove convergence) • a sort of Usage Simulation
Pagerank: use in Search Engines • Preprocessing: • Given graph, build matrix • Compute its principal eigenvector r • r[i] is the pagerank of page i We are interested in the relative order • Query processing: • Retrieve pages containing query terms • Rank them by their Pagerank The final order is query-independent aPT + (1- a) e eT
Calculating HITS • It is query-dependent • Produces two scores per page: • Authority score: a good authority page for a topic is pointed to by many good hubs for that topic. • Hub score: A good hub page for a topic points to many authoritative pages for that topic.
Authority and Hub scores 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4)
HITS: Link Analysis Computation Where a: Vector of Authority’s scores h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij Symmetric matrices Thus, h is an eigenvector of AAt a is an eigenvector of AtA
Weighting links Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).
Latent Semantic Indexing(mapping onto a smaller space of latent concepts) Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 18
Speeding up cosine computation • What if we could take our vectors and “pack” them into fewer dimensions (say 50,000100) while preserving distances? • Now, O(nm) • Then, O(km+kn) where k << n,m • Two methods: • “Latent semantic indexing” • Random projection
A sketch • LSI is data-dependent • Create a k-dim subspace by eliminating redundant axes • Pull together “related” axes – hopefully • car and automobile • Random projection is data-independent • Choose a k-dim subspace that guarantees good stretching properties with high probability between pair of points. What about polysemy ?
Notions from linear algebra • Matrix A, vector v • Matrix transpose (At) • Matrix product • Rank • Eigenvalues l and eigenvector v: Av = lv
Overview of LSI • Pre-process docs using a technique from linear algebra called Singular Value Decomposition • Create a new (smaller) vector space • Queries handled (faster) in this new space
Singular-Value Decomposition • Recall mn matrix of terms docs, A. • A has rank r m,n • Define term-term correlation matrix T=AAt • T is a square, symmetric mm matrix • Let P be mrmatrix of eigenvectors of T • Define doc-doc correlation matrix D=AtA • D is a square, symmetric nn matrix • Let R be nrmatrix of eigenvectors of D
A’s decomposition • Do exist matrices P(for T, mr) and R(for D, nr) formed by orthonormal columns (unit dot-product) • It turns out that A = PSRt • WhereS is a diagonal matrix with the eigenvalues of T=AAt in decreasing order. mn mr rn = rr Rt S P A
document k k 0 k 0 0 useless due to 0-col/0-row of Sk Dimensionality reduction • For some k << r, zeroout all but the k biggest eigenvalues in S[choice of k is crucial] • Denote by Sk this new version of S, having rank k • Typically k is about 100, while r (A’s rank) is > 10,000 = r Sk Rt S P A Ak k x n r x n m x r m x k
Guarantee • Akis a pretty good approximation to A: • Relative distances are (approximately) preserved • Of all mn matrices of rank k, Ak is the best approximation to A wrt the following measures: • minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = sk+1 • minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = sk+12+ sk+22+...+ sr2 • Frobenius norm ||A||F2 = s12+ s22+...+ sr2
R,P are formed by orthonormal eigenvectors of the matrices D,T Reduction • Xk = Sk Rtis the doc-matrix k x n, hence reduced to k dim • Take the doc-correlation matrix: • It is D=AtA=(P SRt)t(P SRt) = (SRt)t (SRt) • Approx S with Sk, thus get AtA Xkt Xk (both are n x n matr.) • We use Xkto define A’s projection: • Xk= SkRt , substitute Rt = S-1Pt A, so getPkt A. • In fact, SkS-1Pt = Pkt which is a k x m matrix • This means that to reduce a doc/query vector is enough to multiply it by Pkt • Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)
Which are the concepts ? • c-th concept = c-th row of Pkt (which is k x m) • Denote it by Pkt [c], whose size is m = #terms • Pkt [c][i] = strength of association between c-th concept and i-th term • Projected document: d’j= Pkt dj • d’j[c] = strenght of concept c in dj • Projected query: q’ = Pkt q • q’[c] = strenght of concept c in q
Random Projections Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only !
An interesting math result d is our previous m = #terms Setting v=0 we also get a bound on f(u)’s stretching!!!
What about the cosine-distance ? f(u)’s, f(v)’s stretching substituting formula above
E[ri,j] = 0 Var[ri,j] = 1 A practical-theoretical idea !!!
Finally... • Random projections hide large constants • k (1/e)2 * log d, so it may be large… • it is simple and fast to compute • LSI is intuitive and may scale to any k • optimal under various metrics • but costly to compute
Document duplication(exact or approximate) Paolo Ferragina Dipartimento di Informatica Università di Pisa Slides only!
Sec. 19.6 Duplicate documents • The web is full of duplicated content • Few exact duplicate detection • Many cases of nearduplicates • E.g., Last modified date the only difference between two copies of a page
Natural Approaches • Fingerprinting: • only works for exact matches • Random Sampling • sample substrings (phrases, sentences, etc) • hope: similar documents similar samples • But – even samples of same document will differ • Edit-distance • metric for approximate string-matching • expensive – even for one pair of strings • impossible – for 1032 web documents
Exact-Duplicate Detection • Obvious techniques • Checksum – no worst-case collision probability guarantees • MD5 – cryptographically-secure string hashes • relatively slow • Karp-Rabin’s Scheme • Algebraic technique – arithmetic on primes • Efficient and other nice properties…
Karp-Rabin Fingerprints • Consider – m-bit string A=a1 a2 … am • Assume – a1=1 and fixed-length strings (wlog) • Basic values: • Choose a prime p in the universe U, such that 2p uses few memory-words (hence U ≈ 264) • Set h = dm-1 mod p • Fingerprints: f(A) = A mod p • Nice property is that if B = a2 … am am+1 • f(B) = [d (A - a1 h) + am+1 ] mod p • Prob[false hit] = Prob p divides (A-B) = #div(A-B)/U ≈ (log (A+B)) / U = m/U
Near-Duplicate Detection • Problem • Given a large collection of documents • Identifythe near-duplicate documents • Web search engines • Proliferation of near-duplicate documents • Legitimate – mirrors, local copies, updates, … • Malicious – spam, spider-traps, dynamic URLs, … • Mistaken – spider errors • 30% of web-pages are near-duplicates [1997]
Desiderata • Storage: only small sketchesof each document. • Computation:the fastest possible • Stream Processing: • once sketch computed, source is unavailable • Error Guarantees • problem scale small biases have large impact • need formal guarantees – heuristics will not do
Basic Idea[Broder 1997] • Shingling • dissect document into q-grams(shingles) • represent documents by shingle-sets • reduce problem to set intersection [ Jaccard ] • They are near-duplicates if large shingle-sets intersect enough • We know how to cope with “Set Intersection” • fingerprints of shingles (for space efficiency) • min-hash to estimate intersections sizes (for time and space efficiency)
Multiset of Shingles Multiset of Fingerprints Doc shingling fingerprint Documents Sets of 64-bit fingerprints • Fingerprints: • Use Karp-Rabin fingerprints over q-gram shingles (of 8q bits) • Fingerprint space [0, …, U-1] • In practice, use 64-bit fingerprints, i.e., U=264 • Prob[collision] ≈ (8q)/264 << 1
SA SB Similarity of Documents Doc A Doc B • Jaccard measure – similarity of SA, SB U = [0 … N-1] • Claim: A & B are near-duplicates if sim(SA,SB) is high
Sec. 19.6 Speeding-up: Sketch of a document • Intersecting directly the shingles is too costly • Create a “sketch vector” (of size ~200) for each document • Documents that share ≥t(say 80%) corresponding vector elements are near duplicates
Sketching by Min-Hashing • Consider • SA, SB P • Pick a random permutation π of P (such as ax+b mod |P|) • Define = π -1( min{π(SA)} ) , b = π -1( min{π(SB)} ) • minimal element under permutation π • Lemma: