440 likes | 564 Views
Finding Replicated Web Collections ( Junghoo Cho, Narayanan Shivakumar , Hector Garcia-Molina) A Comparison of Techniques to Find Mirrored Hosts on WWW (Krishna Bharat, Andrei Broder , Jeffrey Dean, Monika Henzinger ). Authors . Authors . Identifying replicated content
E N D
Finding Replicated Web Collections(Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina)A Comparison of Techniques to Find Mirrored Hosts on WWW(Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika Henzinger)
Identifying replicated content • Cho et al, a bottom up approach • Using content based analysis • Computing similarity measures • Improved Crawling • Reducing clutter from search engine results • Bharat et al, a top down approach • Using page attributes • URL, IP Address, Connectivity What are they talking about?
Needs only the URLs of pages, not the pages themselves • Mirrors can be discovered even when very few of their duplicate pages are simultaneously present in the collection Pros and cons – Top down
Might discover mirrors • even under renaming of paths • Too small for top down appraoch • Changed pages between different crawling intervals might create problems Pros and cons – Bottom up
Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Finding replicated web collections
Crawler’s task becomes easy • Improved search engine results • Ranking • Improved Archiving Why Identifying Replicated Content is Important?
Update Frequency Why Replicated Content Identification is Difficult? ? ? ? ? dup-2.com dup-1.com www.original.com
Mirror Partial Coverage Why Replicated Content Identification is Difficult? www.original.com dup-1.com dup-2.com
Different Formats Why Replicated Content Identification is Difficult? dup-2.com dup-1.com www.original.com
Partial Crawls Why Replicated Content Identification is Difficult? duplicate.com www.original.com
Similarity of Collections – Collection Induced Subgraph Collection Size = 4 • Assumption : • Location of the hyperlinks in the pages is immaterial
Similarity of Collections – Identical Collection dup-1.com www.original.com
Close copies of each other – Human view Automatic identification, over large web pages Similarity of Collections – Similar Collection Textual Overlap Option
Similarity of Collections – Similar Collection c 10110001001000010111101001000100011 10101001011111101010101011110101011 10110111010101001011010101010110001 32 bits
Similarity of Collections – Similar Collection Text 2 Text 2 101111….011 101010….011 100010….011 1111110….011 100010….011 101110….011 111010….011 110100….011 1111110….011 100010….011 101010….011 101101….011 101010….011 101010….011 110100….011 111010….011
Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10110001001000010111101001000100011 X out Y matches If X > T (threshold) => Two pages are similar
Similarity of Collections – Transitive Similarity P P` P` P`` P P` P`` P P``
Similarity of link structure • One-to-one • Collection Sizes
Similarity of link structure • Link Similarity • Break Points
Clusters Similar Cluster : CiCj, i,j (Pairwise Similarity) Cluster Cardinality = number of collections Cluster = equi-sized collections Identical Cluster : CiCj, i,j
Computing similar clusters Cluster Cardinality = 2 Collection Size = 5
Computing similar clusters Cluster Cardinality = 3 Collection Size = 3
Identify trivial clusters Cluster growing algorithm
Growth Strategy Cluster growing algorithm Ri Rj si,j = 3 di,j = 3 |Ri| = 3 |Rj| = 3 si,j= di,j = |Ri| = |Rj|
Sample • Select 25 replicated collections – target • 5-10 mirrors from each target • 35000 pages from target + 15000 random pages • Results • 180 non-trivial collections • 149 collections -> 25 clusters • 180 – 149 = 31 problem collection • Due to partial mirrors Quality of similarity measure
Partial Mirrors Quality of similarity measure
Change of growth strategy • Change of results • 23 more clusters identified • Only 8 in problem collection • Success rate of 172 out of 180. Quality of similarity measure Extended Clusters si,j= |Ri| ≥di,j= |Rj|
Data set • 25 million web pages, domains with US • The chunking strategies. Fingerprint for : • entire document • every four lines (Threshold = 15) • every two lines of text (Threshold = 25) Improved crawling
Problems • Multiple pages from the same collection • Links to several replicated contents • Solution • Suppressing and grouping results • “Replica” link and “Collection” link in results Improved result presentation
Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika R. Henzinger A Comparison of Techniques to Find Mirrored Hosts on the WWW
A and B are mirrors • For every document in A • Highly similar document in B • With the same path • Considering only to entire web sites • Partial mirrors ignored Concepts
IP address based • Identical or similar IP addresses • URL string based • Term vector matching on URL strings • Host name matching • Full path matching • Prefix matching • Positional word bigram matching Methodology
URL string and connectivity based • URL string based + outlinks • Host connectivity based • Two hosts are mirrors if they link to similar set of hosts Methodology
Terminology Precision at correct host pairs within K = ------------------------------------rank K total host pairs within K Recall at correct host pairs within K = ------------------------------------rank K total host pairs within K, from all algos
IP4 and prefix were the best single algorithms • Limited in recall • Best approach – a combination of all Conclusion from results
What are the different situations one can use link based and content based analysis for duplicate detection? What are the methods to improve content base analysis? How can we merge both methods? What will be the improvements? discussion