180 likes | 353 Views
Finding replicated web collections. Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Stanford University Presented by: William Quach CSCI 572 University of Southern California. Outline. Replication on the web Importance of de-duplication in today’s Internet Similarity
E N D
Finding replicated web collections Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Stanford University Presented by: William Quach CSCI 572 University of Southern California
Outline • Replication on the web • Importance of de-duplication in today’s Internet • Similarity • Identifying similar collections • Growing similar collections • How is this useful? • Contributions of the paper • Pros/Cons of the paper • Related work Finding replicated web collections
Replication on the web • Some reasons for duplication • Reliability • Performance: caching, load balancing • Archival • Anarchy on the web makes duplicating easy but finding duplicates hard. • Same page on different URL: protocol, host, domain, etc [2]. • Many aspects of mirrored sites prevent us from identifying replication by finding exact matches • Freshness, coverage, formats, partial crawls Finding replicated web collections
Importance of deduplication in today’s Internet • The Internet grows at an extremely fast pace [1]. • Crawling becomes more and more difficult if done in a brute force attempt. • Intelligent algorithms can achieve similar results in less time using less memory. • We need these more intelligent algorithms to be able to fully utilize the ever-growing web of information [1]. Finding replicated web collections
Similarity • Similarity of pages • Similarity of link structure • Similarity of collections Finding replicated web collections
Similarity of pages • Various metrics for determining page similarity based on… • Information retrieval • Data mining • Intuition: Textual Overlap • Counting chunks of text that overlap. • Requires threshold based on empirical data Finding replicated web collections
Similarity of pages • The paper uses the Textual Overlap metric • Convert page into text • Divide text into obvious chunk (e.g. sentences) • Hash each chunk to determine “fingerprint” of chunks • Two pages are similar if has more than some threshold of identical chunks. Finding replicated web collections
Similarity of link structure • At least one matching incoming link, unless no incoming links exists • For each page p in C1, let P1(p) be the set of pages in C1 that have a link to page p • For the corresponding similar page p’ in C2, let P2(p’) be the set of pages in C2 that have a link to page p’. • Then we must have pages p1 ∈ P1(p) and p2 ∈ P2(p’), unless P1(p) and P2(p’) are empty. Finding replicated web collections
Similarity of collections • Collections are similar if they have similar pages and similar link structure • To control complexity, the method in the paper only considers: • Equi-sized collections • One-to-one mapping of similar pages • Terminology: • Collection: a group of linked pages (e.g. website) • Cluster: a group of collections • Similar cluster: a group of similar collections • Too expensive to compute the optimal set of similar clusters • Start with trivial clusters and “grow” them Finding replicated web collections
Growing Clusters • Trivial clusters – similar clusters with single page collections, basically a cluster of similar pages. • Two trivial clusters are merged if they become a similar cluster with larger collections. • Continue until no merger can produce a similar cluster. Finding replicated web collections
Growing Clusters Finding replicated web collections
How is this useful? • Improving crawling • This is obvious. If the crawler knows which collections are similar it can avoid crawling for the same information. Experimental results in the paper shows a 48% drop in the number of similar pages crawled. • Improving querying • Filter search results to “roll-up” similar pages so that more distinct pages are visible to the user on the first page. Finding replicated web collections
Contributions • Clearly defined the problem and provided a basic solution. • Helps people understand the problem. • Proposed a new algorithm to identify similar collections. • Provided experimental results on the benefits of identifying similar collections to improve crawling and querying. • Proves that it is a worthwhile problem to solve. • Clearly stated trade-offs and assumptions of their algorithm, setting the stage for future work. Finding replicated web collections
Pros • Thoroughly defined the problem. • Presented a concise and effective algorithm to address the problem. • Cleary stated any trade-offs made so that algorithm can be improved in future work. • Simplifications made are mainly to control complexity and allow the solution to be more comprehensible • Left the de-simplification of their algorithm to future work Finding replicated web collections
Cons • Similar collections must be equi-size. • Similar collections must have one-to-one mappings of all pages • High probability for break points. Collections can become highly chunked. • Thresholding required to determine page similarity may be a very tedious task Finding replicated web collections
Related Work • “Detecting Near-Duplicates for Web Crawling” (2007) [5] • Takes a lower level, in depth approach to determining page similarity. • Hashing algorithms • Good supplement • “Do Not Crawl in the DUST: Different URLs with Similar Text” (2009) [6] • Takes a different approach that identifies URLs that point to the same/similar content. • e.g. www.myhomepage.com and www.myhomepage.com/index.html • Does not look into page content • Focus on the “low-hanging” fruits Finding replicated web collections
Questions? Finding replicated web collections
References • [1] C. Mattmann. Characterizing the Web.CSCI 572 course lecture at USC, May 20, 2010 • [2] C. Mattmann. Deduplication.CSCI 572 course lecture at USC, June 1, 2010 • [3] M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence, 1998 • [4] Gerald Salton. Introduction to modern information retrieval. McGraw-Hill, New York, 1983 • [5] Manku, G. S., Jain, A., and Das Sarma, A. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07. ACM, New York, NY, 141-150. • [6] Bar-Yossef, Z., Keidar, I., and Schonfeld, U. 2009. Do not crawl in the DUST: Different URLs with similar text. ACM Trans. Web 3, 1 (Jan. 2009), 1-31. Finding replicated web collections