280 likes | 384 Views
Finding Replicated Web Collections. Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina. Replication is common!. Statistics (Preview). More than 48% of pages have copies!. Reasons for replication. Actual replication Simple copying or Mirroring Apparent replication
E N D
Finding ReplicatedWeb Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina
Statistics (Preview) More than 48% of pages have copies!
Reasons for replication Actual replication • Simple copying or Mirroring Apparent replication • Aliases (multiple site names) • Symbolic links • Multiple mount points
Challenges • Subgraph isomorphism: NP • Hundreds of millions of pages • Slight differences between copies
Outline • Definitions • Web graph, collection • Identical collection • Similar collection • Algorithm • Applications • Results
Web graph • Node: web page • Edge: link between pages • Node label: page content (excluding links)
Identical web collection • Collection: induced subgraph • Identical collection: one-to-one (equi-size)
Collection similarity • Coincides with intuitively similar collections • Computable similarity measure
Collection similarity • Page content
Page content similarity • Fingerprint-based approach (chunking) • Shingles [Broders et al., 1997] • Sentence [Brin et al., 1995] • Word [Shivakumar et al., 1995] • Many interesting issues • Threshold value • Iceberg query
Collection similarity • Link structure
Collection similarity • Size
Collection similarity • Size vs. Cardinality
Ra a a a b b b |Ra| = Ls = Ld = |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to Rb
Ra a a a Rb b b b |Ra| Ls = Ld |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to
Algorithm • Based on the property we identified • Input: set of pages collected from web • Output: set of similar collections • Complexity: O(n log n)
Rid Pid 1 10375 1 38950 1 14545 2 1026 2 18633 Algorithm • Step 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours web pages Step 1
Rid Rid Pid Pid 1 1 10375 10375 1 1 38950 38950 1 1 14545 14545 2 2 1026 1026 Algorithm • Step 2: link structure check R1 Link R2 (Copy of R1) Pid Pid 1 2 1 3 2 6 2 10 Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|
Algorithm • Step 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U {<Ra, Rb>} Union-Find(S) • Step 2-3: 10 hours
Experiment • 25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) => Total number of pages : 35,000 + 15,000 random pages • Result: 180 collections • 149 “good” collections • 31 “problem” collections
Applications • Web crawling & archiving • Save network bandwidth • Save disk storage
Application (web crawling) • Before experiment: 48% • With our technique: 13% crawledpages replicationinfo initialcrawl offline copydetection secondcrawl
Related work • Collection similarity • Altavista [Bharat et al., 1999] • Page similarity • COPS [Brin et al., 1995]: sentence • SCAM [Shivakumar et al., 1995]: word • Altavista [Broder et al., 1997]: shingle
Summary • Computable similarity measure • Efficient replication-detection algorithm • Application to real-world problems