1 / 28

Finding Replicated Web Collections

Finding Replicated Web Collections. Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina. Replication is common!. Statistics (Preview). More than 48% of pages have copies!. Reasons for replication. Actual replication Simple copying or Mirroring Apparent replication

alvaro
Download Presentation

Finding Replicated Web Collections

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding ReplicatedWeb Collections Junghoo Cho Narayanan Shivakumar Hector Garcia-Molina

  2. Replication is common!

  3. Statistics (Preview) More than 48% of pages have copies!

  4. Reasons for replication Actual replication • Simple copying or Mirroring Apparent replication • Aliases (multiple site names) • Symbolic links • Multiple mount points

  5. Challenges • Subgraph isomorphism: NP • Hundreds of millions of pages • Slight differences between copies

  6. Outline • Definitions • Web graph, collection • Identical collection • Similar collection • Algorithm • Applications • Results

  7. Web graph • Node: web page • Edge: link between pages • Node label: page content (excluding links)

  8. Identical web collection • Collection: induced subgraph • Identical collection: one-to-one (equi-size)

  9. Collection similarity • Coincides with intuitively similar collections • Computable similarity measure

  10. Collection similarity • Page content 

  11. Page content similarity • Fingerprint-based approach (chunking) • Shingles [Broders et al., 1997] • Sentence [Brin et al., 1995] • Word [Shivakumar et al., 1995] • Many interesting issues • Threshold value • Iceberg query

  12. Collection similarity • Link structure 

  13. Collection similarity • Size

  14. Collection similarity  • Size vs. Cardinality

  15. Growth strategy

  16. Ra a a a b b b |Ra| = Ls = Ld = |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to Rb

  17. Ra a a a Rb b b b |Ra|  Ls = Ld  |Rb| Essential property Ls: # of pages linked from Ld: # of pages linked to

  18. Algorithm • Based on the property we identified • Input: set of pages collected from web • Output: set of similar collections • Complexity: O(n log n)

  19. Rid Pid 1 10375 1 38950 1 14545 2 1026 2 18633 Algorithm • Step 1: Similar page identification (iceberg query) 25 million pages Fingerprint computation: 44 hours Replicated page computation: 10 hours web pages Step 1

  20. Rid Rid Pid Pid 1 1 10375 10375 1 1 38950 38950 1 1 14545 14545 2 2 1026 1026 Algorithm • Step 2: link structure check R1 Link R2 (Copy of R1) Pid Pid 1 2 1 3 2 6 2 10 Group by (R1.Rid, R2.Rid) Ra = |R1|, Ls = Count(R1.Rid), Ld = Count(R2.Rid), Rb = |R2|

  21. Algorithm • Step 3: S = {} For every (|Ra|, Ls, Ld, |Rb|) in step 2 If (|Ra| = Ls = Ld = |Rb|) S = S U {<Ra, Rb>} Union-Find(S) • Step 2-3: 10 hours

  22. Experiment • 25 widely replicated collections (cardinality: 5-10 copies, size: 50-1000 pages) => Total number of pages : 35,000 + 15,000 random pages • Result: 180 collections • 149 “good” collections • 31 “problem” collections

  23. Results

  24. Applications • Web crawling & archiving • Save network bandwidth • Save disk storage

  25. Application (web crawling) • Before experiment: 48% • With our technique: 13% crawledpages replicationinfo initialcrawl offline copydetection secondcrawl

  26. Applications (web search)

  27. Related work • Collection similarity • Altavista [Bharat et al., 1999] • Page similarity • COPS [Brin et al., 1995]: sentence • SCAM [Shivakumar et al., 1995]: word • Altavista [Broder et al., 1997]: shingle

  28. Summary • Computable similarity measure • Efficient replication-detection algorithm • Application to real-world problems

More Related