1 / 52

Authors

Finding Replicated Web Collections ( Junghoo Cho, Narayanan Shivakumar , Hector Garcia-Molina) A Comparison of Techniques to Find Mirrored Hosts on WWW (Krishna Bharat, Andrei Broder , Jeffrey Dean, Monika Henzinger ). Authors . Authors . Authors . Authors .

bazyli
Download Presentation

Authors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Replicated Web Collections(Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina)A Comparison of Techniques to Find Mirrored Hosts on WWW(Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika Henzinger)

  2. Authors

  3. Authors

  4. Authors

  5. Authors

  6. Identifying replicated content • Cho et al, a bottom up approach • Using content based analysis • Computing similarity measures • Improved Crawling • Reducing clutter from search engine results • Bharat et al, a top down approach • Using page attributes • URL, IP Address, Connectivity What are they talking about?

  7. Needs only the URLs of pages, not the pages themselves • Mirrors can be discovered even when very few of their duplicate pages are simultaneously present in the collection Pros and cons – Top down

  8. Might discover mirrors • even under renaming of paths • Too small for top down appraoch • Changed pages between different crawling intervals might create problems Pros and cons – Bottom up

  9. Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Finding replicated web collections

  10. man printf

  11. Crawler’s task becomes easy • Improved search engine results • Ranking • Improved Archiving Why Identifying Replicated Content is Important?

  12. Update Frequency Why Replicated Content Identification is Difficult? ? ? ? ? dup-2.com dup-1.com www.original.com

  13. Mirror Partial Coverage Why Replicated Content Identification is Difficult? www.original.com dup-1.com dup-2.com

  14. Different Formats Why Replicated Content Identification is Difficult? dup-2.com dup-1.com www.original.com

  15. Partial Crawls Why Replicated Content Identification is Difficult? duplicate.com www.original.com

  16. Similarity of Collections – WEB GRaph

  17. Similarity of Collections – WEB GRaph

  18. Similarity of Collections – Collection

  19. Similarity of Collections – Collection Induced Subgraph Collection Size = 4 • Assumption : • Locations of the hyperlinks in the pages are immaterial

  20. Similarity of Collections – Identical Collection dup-1.com www.original.com

  21. Close copies of each other – Human view Automatic identification, over large web pages Similarity of Collections – Similar Collection Textual Overlap Option

  22. Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10101001011111101010101011110101011 10110111010101001011010101010110001 32 bits

  23. Similarity of Collections – Similar Collection Text 2 Text 2 101111….011 101010….011 100010….011 1111110….011 100010….011 101110….011 111010….011 110100….011 1111110….011 100010….011 101010….011 101101….011 101010….011 101010….011 110100….011 111010….011

  24. Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10110001001000010111101001000100011 X out Y matches If X > T (threshold) => Two pages are similar

  25. Similarity of Collections – Transitive Similarity  P P`    P` P`` P P` P``  P P``

  26. Similarity of link structure • One-to-one • Collection Sizes

  27. Similarity of link structure • Link Similarity • Break Points

  28. Clusters • Cluster = equi-sized collections • Cluster Cardinality = number of collections • Identical Cluster : CiCj, i,j • Similar Cluster : CiCj, i,j(Pairwise Similarity)

  29. Computing similar clusters Cluster Cardinality = 2 Collection Size = 5

  30. Computing similar clusters Cluster Cardinality = 3 Collection Size = 3

  31. Identify trivial clusters Cluster growing algorithm

  32. Growth Strategy Cluster growing algorithm Ri Rj si,j = 3 di,j = 3 |Ri| = 3 |Rj| = 3 si,j= di,j = |Ri| = |Rj|

  33. Cluster growing algorithm

  34. Sample • Select 25 replicated collections – target • 5-10 mirrors from each target • 35000 pages from target + 15000 random pages • Results • 180 non-trivial collections • 149 collections -> 25 clusters • 180 – 149 = 31 problem collection • Due to partial mirrors Quality of similarity measure

  35. Partial Mirrors Quality of similarity measure

  36. Change of growth strategy • Change of results • 23 more clusters identified • Only 8 in problem collection • Success rate of 172 out of 180. Quality of similarity measure Extended Clusters si,j= |Ri| ≥di,j= |Rj|

  37. Data set • 25 million web pages, domains with US • The chunking strategies. Fingerprint for : • entire document • every four lines (Threshold = 15) • every two lines of text (Threshold = 25) Improved crawling

  38. Improved crawling

  39. Problems • Multiple pages from the same collection • Links to several replicated contents • Solution • Suppressing and grouping results • “Replica” link and “Collection” link in results Improved result presentation

  40. Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika R. Henzinger A Comparison of Techniques to Find Mirrored Hosts on the WWW

  41. A and B are mirrors • For every document in A • Highly similar document in B • With the same path • Considering only to entire web sites • Partial mirrors ignored Concepts

  42. IP address based • Identical or similar IP addresses • URL string based • Term vector matching on URL strings • Host name matching • Full path matching • Prefix matching • Positional word bigram matching Methodology

  43. URL string and connectivity based • URL string based + outlinks • Host connectivity based • Two hosts are mirrors if they link to similar set of hosts Methodology

  44. Terminology Precision at correct host pairs within K = ------------------------------------rank K total host pairs within K Recall at correct host pairs within K = ------------------------------------rank K total host pairs within K, from all algos

  45. Results

  46. Results

  47. Results

More Related