Near-Duplicate Detection for eRulemaking

Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University {huiyang, callan}@cs.cmu.edu

Presentation Outline • Introduction • Problem Definition • System Architecture • Feature-based Document Retrieval • Similarity-based Clustering • Evaluation and Experimental Results • Related Work • Conclusion and Demo

Introduction - I • U.S. regulatory agencies are required to solicit, consider, and respond to public comments before issuing the final regulations. • Some popular regulations attract hundreds of thousands of comments from the general public. • In late 1990s USDA’s national organic standard, manually sort over 250,000 public comments. • In 2004 the EPA’s proposed “Mercury rule” (USEPA-OAR-2002-0056) attracted over 530,000 email messages. • Very labor-intensive

Introduction - II • Things Become Worse Now • Many Online form letters available • Written by special interest groups • Modifying an electronic form letter is extremely easy • Special Interest Groups • Build electronic advocacy groups when there is a disconnect between broad public opinion and legislative action. • Provide information and tools to help each individual have the greatest possible impact once a group is assembled. • Moveon.org, http://www.moveon.org • GetActive, http://www.getactive.org

Introduction - III • Public comments will be near-duplicates if created from the same form letter. • Near-Duplicates increase the likelihood of overlooking substantive information that an individual adds to a form letter. • Goal: • Recognizing the Near-duplicates and organize them • Finding the added information by an individual • Finding the Unique comments • Our research focused on recognizing and organizing near-duplicates by text mining and clustering, as well as handling large amount of data

Problem Definition - I • What is a near-duplicate? • Pugh declared that “two documents are considered near duplicates if they have more than r features in common”. • Conrad, et al., stated that two documents are near duplicates if they share >80% terminology defined by human experts • Our definition based on the ways to create near-duplicates

Problem Definition - II • Sources of Public Comments • from scratch (unique comments) • based on a form letter (exact- or near duplicates) • A Category-based Definition • Block edit • Key block • Minor change • Minor change + block edit • Exact

Block Edit

Key Block

Minor Change

Minor Change + Block Edit

Block Reordering

Exact

System Architecture

Feature-based Document Retrieval - I • To get duplicate candidate set for each seed • Avoid working on the entire dataset • Steps: • Each seed document is broken into chunks • Select the most informative words from each chunk • a text span around term t* which is the term with minimal document frequency in the chunk

Feature-based Document Retrieval - II • Metadata extraction by Information Extraction • email senders, • receivers, • signatures, • docket IDs, • delivered dates, • email relayers

Feature-based Document Retrieval - III • Query Formulation #AND ( docketoar.20020056 router.moveon #OR(“standards proposed by” “will harm thousands” “unborn children for” “coal plants should” “other cleaner alternative” “by 90 by” “with national standards” “available pollution control”) )

Similarity-based Clustering - I • Document dissimlilarity based on Kullback-Leibler (KL) divergence • KL-divergence, a distributional similarity measure, is one way to measure the similarity of one document (one unigram distribution) given another. • Clustering Algorithm • Soft, Non-Hierarchical Clustering (Partition) • Single Pass Clustering with carefully selected seed documents each time • Close to K-Means • No need to define K before hand

Similarity-based Clustering - II Dup Set 1 B N Dup Set 2 Dup Candidates d d d

Adaptive Thresholding • Cut-off threshold for different clusters should be different • Documents in a cluster are sorted by their document-centroid similarity scores. • Sample sorted scores at 10 document intervals • If there is greater than 5% of the initial cut-off threshold within an interval, a new cut-off threshold is set at the beginning of the interval.

Does Feature-based Document Retrieval Helps? • It works fairly efficient for large clusters • cuts the number of documents needed to be clustered from the size of entire dataset to a reasonable number. • 536,975 ->10,995 documents • Bad for small clusters (especially for those only containing a single unique document) • Disable feature-based retrieval after most of the big clusters have been found. • assume that most of the remaining unclustered documents are unique. • Only similarity-based clustering is used on them

Evaluation Methodology - I • Difficult to evaluate clustering performance • lack of man power to produce ground truth for large dataset • Two subsets of 1,000 email messages each were selected randomly from the Mercury dataset. • Assessors: two graduate research assistants • Manually organized the documents into clusters of documents that they felt were near-duplicates • Manually went through one of the experimental clustering results pair by pair ( compare document-centroid pair)

Evaluation Methodology - II • Class j vs. Cluster i • F-measure pij = nij/ ni , rij = nij/ nj F = , Fj = maxi {Fij} • Purity ρ = , ρi = maxj{pij}, • Pairwise-measure Folkes and Mallows index • Kappa κ = p(A) = (a+d)/m , p(E) =

Experimental Results - I

Experimental Results - II

Conclusion • Large Volume Working Set • Duplicate Definition and Automatic evaluation • Feature-based Duplicate Candidate Retrieval • Similarity-based Clustering • Improved Efficiency

Related Work - I • Duplicate detection in other domains: • databases [Bilenko and Mooney 2003] • to find records referring to the same entity but possibly in different representations • electronic publishing [Brin et al. 1995] • to detect plagiarism or to identify different versions of the same document. • web search [Chowdhury et al. 2002] [Pugh 2004] • more efficient web-crawling • effective search results ranking • easy web documents archiving

Related Work - II • Fingerprinting • a compact description of a document, and then do pair-wise comparison of document fingerprints • Shingling [Broder et al.] • represents a document as a series of simple numeric encodings for an n-term window • retain every mth shingle to produce a document sketch • super shingles • Selective fingerprinting [Heintze] • selected a subset of the substrings to generate fingerprints • Statistical approach [Chowdhury et al.] • n high idf terms • Improved accuracy over shingling • Efficient: one-fifth of the time over shingling • Fingerprints Reliability in dynamic environment [Conrad et al.] • Consider time factor on the Web

References • M. Bilenko and R. Mooney. Adaptive duplicate detection using learnable string similarity measures. In Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD-2003), Washington D.C., August 2003. • S. Brin, J. Davis, and H. Garcia-Molina. Copy detection mechanisms for digital documents. In Proceedings of the Special Interest Group on Management of Data (SIGMOD 1995), pages 398–409. ACM Press, May 1995. • A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. Syntactic clustering of the web. In Proceedings of WWW6 ’97, pages 391–404. Elsevier Science, April 1997. • J. Callan, E-Rulemaking testbed. http://hartford.lti.cs.cmu.edu/eRulemaking/Data/. 2004 • A. Chowdhury. O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast Duplicate document detection. In ACM Transactions on Information Systems (TOIS), Volume 20, Issue 2, 2002. • J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37-46, 1960. • J. Conrad, X. S. Guo, and C. P. Schriber. Online duplicate document detection: Signature reliability in a dynamic retrieval environment. In Proceedings of CIKM’03, pages 443–452. ACM Press, Nov. 2003. • N. Heintze. Scalable document fingerprinting. In Proceedings of the Second USENIX electronic Commerce Workshop, pages 191–200, Nov. 1996. • W. Pugh. US Patent 6,658,423 http://www.cs.umd.edu/~pugh/google/Duplicates.pdf. 2003

Demo • http://hartford.lti.cs.cmu.edu/eRulemaking/Data/USEPA-OAR-2002-0056/

Near-Duplicate Detection for eRulemaking