Authors

Finding Replicated Web Collections(Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina)A Comparison of Techniques to Find Mirrored Hosts on WWW(Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika Henzinger)

Authors

Identifying replicated content • Cho et al, a bottom up approach • Using content based analysis • Computing similarity measures • Improved Crawling • Reducing clutter from search engine results • Bharat et al, a top down approach • Using page attributes • URL, IP Address, Connectivity What are they talking about?

Needs only the URLs of pages, not the pages themselves • Mirrors can be discovered even when very few of their duplicate pages are simultaneously present in the collection Pros and cons – Top down

Might discover mirrors • even under renaming of paths • Too small for top down appraoch • Changed pages between different crawling intervals might create problems Pros and cons – Bottom up

Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Finding replicated web collections

man printf

Crawler’s task becomes easy • Improved search engine results • Ranking • Improved Archiving Why Identifying Replicated Content is Important?

Update Frequency Why Replicated Content Identification is Difficult? ? ? ? ? dup-2.com dup-1.com www.original.com

Mirror Partial Coverage Why Replicated Content Identification is Difficult? www.original.com dup-1.com dup-2.com

Different Formats Why Replicated Content Identification is Difficult? dup-2.com dup-1.com www.original.com

Partial Crawls Why Replicated Content Identification is Difficult? duplicate.com www.original.com

Similarity of Collections – WEB GRaph

Similarity of Collections – Collection

Similarity of Collections – Collection Induced Subgraph Collection Size = 4 • Assumption : • Locations of the hyperlinks in the pages are immaterial

Similarity of Collections – Identical Collection dup-1.com www.original.com

Close copies of each other – Human view Automatic identification, over large web pages Similarity of Collections – Similar Collection Textual Overlap Option

Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10101001011111101010101011110101011 10110111010101001011010101010110001 32 bits

Similarity of Collections – Similar Collection Text 2 Text 2 101111….011 101010….011 100010….011 1111110….011 100010….011 101110….011 111010….011 110100….011 1111110….011 100010….011 101010….011 101101….011 101010….011 101010….011 110100….011 111010….011

Similarity of Collections – Similar Collection 10110001001000010111101001000100011 10110001001000010111101001000100011 X out Y matches If X > T (threshold) => Two pages are similar

Similarity of Collections – Transitive Similarity  P P`    P` P`` P P` P``  P P``

Similarity of link structure • One-to-one • Collection Sizes

Similarity of link structure • Link Similarity • Break Points

Clusters • Cluster = equi-sized collections • Cluster Cardinality = number of collections • Identical Cluster : CiCj, i,j • Similar Cluster : CiCj, i,j(Pairwise Similarity)

Computing similar clusters Cluster Cardinality = 2 Collection Size = 5

Computing similar clusters Cluster Cardinality = 3 Collection Size = 3

Identify trivial clusters Cluster growing algorithm

Growth Strategy Cluster growing algorithm Ri Rj si,j = 3 di,j = 3 |Ri| = 3 |Rj| = 3 si,j= di,j = |Ri| = |Rj|

Cluster growing algorithm

Sample • Select 25 replicated collections – target • 5-10 mirrors from each target • 35000 pages from target + 15000 random pages • Results • 180 non-trivial collections • 149 collections -> 25 clusters • 180 – 149 = 31 problem collection • Due to partial mirrors Quality of similarity measure

Partial Mirrors Quality of similarity measure

Change of growth strategy • Change of results • 23 more clusters identified • Only 8 in problem collection • Success rate of 172 out of 180. Quality of similarity measure Extended Clusters si,j= |Ri| ≥di,j= |Rj|

Data set • 25 million web pages, domains with US • The chunking strategies. Fingerprint for : • entire document • every four lines (Threshold = 15) • every two lines of text (Threshold = 25) Improved crawling

Improved crawling

Problems • Multiple pages from the same collection • Links to several replicated contents • Solution • Suppressing and grouping results • “Replica” link and “Collection” link in results Improved result presentation

Krishna Bharat, Andrei Broder, Jeffrey Dean, Monika R. Henzinger A Comparison of Techniques to Find Mirrored Hosts on the WWW

A and B are mirrors • For every document in A • Highly similar document in B • With the same path • Considering only to entire web sites • Partial mirrors ignored Concepts

IP address based • Identical or similar IP addresses • URL string based • Term vector matching on URL strings • Host name matching • Full path matching • Prefix matching • Positional word bigram matching Methodology

URL string and connectivity based • URL string based + outlinks • Host connectivity based • Two hosts are mirrors if they link to similar set of hosts Methodology

Terminology Precision at correct host pairs within K = ------------------------------------rank K total host pairs within K Recall at correct host pairs within K = ------------------------------------rank K total host pairs within K, from all algos

Results

Authors

Authors

Presentation Transcript

AUTHORS:

Authors

Authors

Authors

Authors

Authors

Authors :

Authors:

Authors

Authors

Authors

Authors

Authors

Authors

Authors

Authors

Authors

Authors:

Authors

Authors

Authors

Authors: