520 likes | 716 Views
Finding Replicated Web Collections ( Junghoo Cho, Narayanan Shivakumar , Hector Garcia-Molina) A Comparison of Techniques to Find Mirrored Hosts on WWW (Krishna Bharat, Andrei Broder , Jeffrey Dean, Monika Henzinger ). Authors . Authors . Authors . Authors .
E N D
Finding Replicated Web Collections(Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina)A Comparison of Techniques to Find Mirrored Hosts on WWW(Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika Henzinger)
Identifying replicated content • Cho et al, a bottom up approach • Using content based analysis • Computing similarity measures • Improved Crawling • Reducing clutter from search engine results • Bharat et al, a top down approach • Using page attributes • URL, IP Address, Connectivity What are they talking about?
Needs only the URLs of pages, not the pages themselves • Mirrors can be discovered even when very few of their duplicate pages are simultaneously present in the collection Pros and cons – Top down
Might discover mirrors • even under renaming of paths • Too small for top down appraoch • Changed pages between different crawling intervals might create problems Pros and cons – Bottom up
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Finding replicated web collections
Crawler’s task becomes easy • Improved search engine results • Ranking • Improved Archiving Why Identifying Replicated Content is Important?
Update Frequency Why Replicated Content Identification is Difficult? ? ? ? ? dup-2.com dup-1.com www.original.com
Mirror Partial Coverage Why Replicated Content Identification is Difficult? www.original.com dup-1.com dup-2.com
Different Formats Why Replicated Content Identification is Difficult? dup-2.com dup-1.com www.original.com
Partial Crawls Why Replicated Content Identification is Difficult? duplicate.com www.original.com
Similarity of Collections – Collection Induced Subgraph Collection Size = 4 • Assumption : • Locations of the hyperlinks in the pages are immaterial
Similarity of Collections – Identical Collection dup-1.com www.original.com
Close copies of each other – Human view Automatic identification, over large web pages Similarity of Collections – Similar Collection Textual Overlap Option
Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10101001011111101010101011110101011 10110111010101001011010101010110001 32 bits
Similarity of Collections – Similar Collection Text 2 Text 2 101111….011 101010….011 100010….011 1111110….011 100010….011 101110….011 111010….011 110100….011 1111110….011 100010….011 101010….011 101101….011 101010….011 101010….011 110100….011 111010….011
Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10110001001000010111101001000100011 X out Y matches If X > T (threshold) => Two pages are similar
Similarity of Collections – Transitive Similarity P P` P` P`` P P` P`` P P``
Similarity of link structure • One-to-one • Collection Sizes
Similarity of link structure • Link Similarity • Break Points
Clusters • Cluster = equi-sized collections • Cluster Cardinality = number of collections • Identical Cluster : Ci Cj, i,j • Similar Cluster : Ci Cj, i,j (Pairwise Similarity)
Computing similar clusters Cluster Cardinality = 2 Collection Size = 5
Computing similar clusters Cluster Cardinality = 3 Collection Size = 3
Identify trivial clusters Cluster growing algorithm
Growth Strategy Cluster growing algorithm Ri Rj si,j = 3 di,j = 3 |Ri| = 3 |Rj| = 3 si,j= di,j = |Ri| = |Rj|
Sample • Select 25 replicated collections – target • 5-10 mirrors from each target • 35000 pages from target + 15000 random pages • Results • 180 non-trivial collections • 149 collections -> 25 clusters • 180 – 149 = 31 problem collection • Due to partial mirrors Quality of similarity measure
Partial Mirrors Quality of similarity measure
Change of growth strategy • Change of results • 23 more clusters identified • Only 8 in problem collection • Success rate of 172 out of 180. Quality of similarity measure Extended Clusters si,j= |Ri| ≥di,j = |Rj|
Data set • 25 million web pages, domains with US • The chunking strategies. Fingerprint for : • entire document • every four lines (Threshold = 15) • every two lines of text (Threshold = 25) Improved crawling
Problems • Multiple pages from the same collection • Links to several replicated contents • Solution • Suppressing and grouping results • “Replica” link and “Collection” link in results Improved result presentation
Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika R. Henzinger A Comparison of Techniques to Find Mirrored Hosts on the WWW
A and B are mirrors • For every document in A • Highly similar document in B • With the same path • Considering only to entire web sites • Partial mirrors ignored Concepts
IP address based • Identical or similar IP addresses • URL string based • Term vector matching on URL strings • Host name matching • Full path matching • Prefix matching • Positional word bigram matching Methodology
URL string and connectivity based • URL string based + outlinks • Host connectivity based • Two hosts are mirrors if they link to similar set of hosts Methodology
Terminology Precision at correct host pairs within K = ------------------------------------rank K total host pairs within K Recall at correct host pairs within K = ------------------------------------rank K total host pairs within K, from all algos