Finding replicated web collections

Finding replicated web collections Junghoo Cho, Narayanan Shivakumar, Hector Garcia-Molina Stanford University Presented by: William Quach CSCI 572 University of Southern California

Outline • Replication on the web • Importance of de-duplication in today’s Internet • Similarity • Identifying similar collections • Growing similar collections • How is this useful? • Contributions of the paper • Pros/Cons of the paper • Related work Finding replicated web collections

Replication on the web • Some reasons for duplication • Reliability • Performance: caching, load balancing • Archival • Anarchy on the web makes duplicating easy but finding duplicates hard. • Same page on different URL: protocol, host, domain, etc [2]. • Many aspects of mirrored sites prevent us from identifying replication by finding exact matches • Freshness, coverage, formats, partial crawls Finding replicated web collections

Importance of deduplication in today’s Internet • The Internet grows at an extremely fast pace [1]. • Crawling becomes more and more difficult if done in a brute force attempt. • Intelligent algorithms can achieve similar results in less time using less memory. • We need these more intelligent algorithms to be able to fully utilize the ever-growing web of information [1]. Finding replicated web collections

Similarity • Similarity of pages • Similarity of link structure • Similarity of collections Finding replicated web collections

Similarity of pages • Various metrics for determining page similarity based on… • Information retrieval • Data mining • Intuition: Textual Overlap • Counting chunks of text that overlap. • Requires threshold based on empirical data Finding replicated web collections

Similarity of pages • The paper uses the Textual Overlap metric • Convert page into text • Divide text into obvious chunk (e.g. sentences) • Hash each chunk to determine “fingerprint” of chunks • Two pages are similar if has more than some threshold of identical chunks. Finding replicated web collections

Similarity of link structure • At least one matching incoming link, unless no incoming links exists • For each page p in C1, let P1(p) be the set of pages in C1 that have a link to page p • For the corresponding similar page p’ in C2, let P2(p’) be the set of pages in C2 that have a link to page p’. • Then we must have pages p1 ∈ P1(p) and p2 ∈ P2(p’), unless P1(p) and P2(p’) are empty. Finding replicated web collections

Similarity of collections • Collections are similar if they have similar pages and similar link structure • To control complexity, the method in the paper only considers: • Equi-sized collections • One-to-one mapping of similar pages • Terminology: • Collection: a group of linked pages (e.g. website) • Cluster: a group of collections • Similar cluster: a group of similar collections • Too expensive to compute the optimal set of similar clusters • Start with trivial clusters and “grow” them Finding replicated web collections

Growing Clusters • Trivial clusters – similar clusters with single page collections, basically a cluster of similar pages. • Two trivial clusters are merged if they become a similar cluster with larger collections. • Continue until no merger can produce a similar cluster. Finding replicated web collections

Growing Clusters Finding replicated web collections

How is this useful? • Improving crawling • This is obvious. If the crawler knows which collections are similar it can avoid crawling for the same information. Experimental results in the paper shows a 48% drop in the number of similar pages crawled. • Improving querying • Filter search results to “roll-up” similar pages so that more distinct pages are visible to the user on the first page. Finding replicated web collections

Contributions • Clearly defined the problem and provided a basic solution. • Helps people understand the problem. • Proposed a new algorithm to identify similar collections. • Provided experimental results on the benefits of identifying similar collections to improve crawling and querying. • Proves that it is a worthwhile problem to solve. • Clearly stated trade-offs and assumptions of their algorithm, setting the stage for future work. Finding replicated web collections

Pros • Thoroughly defined the problem. • Presented a concise and effective algorithm to address the problem. • Cleary stated any trade-offs made so that algorithm can be improved in future work. • Simplifications made are mainly to control complexity and allow the solution to be more comprehensible • Left the de-simplification of their algorithm to future work Finding replicated web collections

Cons • Similar collections must be equi-size. • Similar collections must have one-to-one mappings of all pages • High probability for break points. Collections can become highly chunked. • Thresholding required to determine page similarity may be a very tedious task Finding replicated web collections

Related Work • “Detecting Near-Duplicates for Web Crawling” (2007) [5] • Takes a lower level, in depth approach to determining page similarity. • Hashing algorithms • Good supplement • “Do Not Crawl in the DUST: Different URLs with Similar Text” (2009) [6] • Takes a different approach that identifies URLs that point to the same/similar content. • e.g. www.myhomepage.com and www.myhomepage.com/index.html • Does not look into page content • Focus on the “low-hanging” fruits Finding replicated web collections

Questions? Finding replicated web collections

References • [1] C. Mattmann. Characterizing the Web.CSCI 572 course lecture at USC, May 20, 2010 • [2] C. Mattmann. Deduplication.CSCI 572 course lecture at USC, June 1, 2010 • [3] M. Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. In Fifteenth National Conference on Artificial Intelligence, 1998 • [4] Gerald Salton. Introduction to modern information retrieval. McGraw-Hill, New York, 1983 • [5] Manku, G. S., Jain, A., and Das Sarma, A. 2007. Detecting near-duplicates for web crawling. In Proceedings of the 16th international Conference on World Wide Web (Banff, Alberta, Canada, May 08 - 12, 2007). WWW '07. ACM, New York, NY, 141-150. • [6] Bar-Yossef, Z., Keidar, I., and Schonfeld, U. 2009. Do not crawl in the DUST: Different URLs with similar text. ACM Trans. Web 3, 1 (Jan. 2009), 1-31. Finding replicated web collections

Finding replicated web collections

Finding replicated web collections

Presentation Transcript

Finding Information on the web

Scaleable Replicated Databases

Mobile Replicated Data

Automatically securing web 2.0 applications through replicated execution

Ripley: Automatically Securing Web 2.0 Applications Through Replicated Execution

Replicated State Machines

Connecting Community to Collections with Online Finding Aids

Finding Optimal Probabilistic Generators for XML Collections

Replicated Data Protocols

Replicated Data Management

Finding Optimal Probabilistic Generators for XML Collections

Replicated Binary Designs

Replicated Databases

Finding Replicated Web Collections

Finding It on the Web

Replicated Distributed Programs

Replicated Distributed Systems

Replicated Stratified Sampling

Finding a Professional Web Designer

Finding Associations in Collections of Text

Scaleable Replicated Databases

Replicated Databases