Near-Duplicate Detection by Instance-level Constrained Clustering

Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University

Introduction Near-Duplicate Detection • To identify and organize “nearly-identical” documents • Different definition of “similarity” from other fields • Database: Almost-identical documents • Finger-prints based approaches • Only allow small changes to the texts • Sensitive to text positions • Information Retrieval: Relevant documents • Bag-of-word approaches • Measure overlap of the vocabulary • Focus more on semantic similarity while near-duplicates more on syntactic (surface text) similarity • Cannot identify near-duplicates when they only share a small amount of text

Near-Duplicate Detection in eRulemaking • U.S. regulatory agencies receive and deal with large amount of public comments everyday • By law, they need to read each of them • Many of them are “Form Letters” • Generate comments based on form letters provided by online special interest groups • http://www.moveon.org • http://www.getactive.com • Need to automate the duplicate detection process and save human effort

Editing Styles • Block Added: Add one or more paragraphs (<200 words) to a document; • Block Deleted: Remove one or more paragraphs (<200 words) from a document; • Key Block: Contains at least one paragraph from a document; • Minor Change: A few words altered within a paragraph (<5% or 15 word change in a paragraph) ; • Minor Change & Block Edit: A combination of minor change and block edit; • Block Reordering: Reorder the same set of paragraphs; • Repeated: Repeat the entire document several times in another document; • Bag-of-word similar: >80% word overlap (not in above categories); and • Exact: 100% word overlap.

“Key Block” Problem

Need More Flexible Framework • Need to use additional knowledge from the document collection • Instance-level Constrained Clustering • A semi-supervised clustering approach to incorporate additional knowledge • Document attributes • Content structure • Pair-wise relationships

Instance-level Constrained Clustering • Instance-level Constraints • Pair-wise • Easy to generate • Cannot generate class labels • Weaker condition than semi-supervised classification • Types of Constraints • Must-links, cannot-links, family-links

Must-links • Two instances must be in the same cluster • Created when • complete containment of the reference copy (key block), • word overlap > 95% (minor change).

Cannot-links • Two instances cannot be in the same cluster • Created when two documents • cite different docket identification numbers • People submitted comments to wrong place

Family-links • Two instances are likely to be in the same cluster • Created when two documents have • the same email relayer, • similar file sizes, or • the same footer block.

+ + + + + + + + + + + + + + + + + + + + Must-links Group the Corrects

+ + + + + + + + + + + + + + + + + - + - Cannot-links Push Away Wrongs

+ + + + + + + + + + + + + + + + + + + + Family-links Attract the Similars

Constraint Transitive Closure • An initial set of constraints are created for pairs of documents • Taking transitive closure over the constraints • Must-link transitive closure: da=m db , db=m dc => da=m dc • Cannot-link transitive closure: da=c db , db=m dc => da=c dc • Family-link transitive closure: da=f db , db=m dc => da=f dc da=f db , db=c dc => da=c dc da=f db , db=f dc => da=f dc ( =m, =c and =f indicate must-link, cannot-link and family-link respectively.)

Constraint Transitive Closure • Example:

F F F F F F Document-Space With Initial Links Form letter Cannot link Must link Family link

F F F F F F Document-Space After Link Propagation Form letter Cannot link Must link Family link

Incorporating the Constraints • When forming clusters, • if two documents have a must-link, they must be put into same group, even if their text similarity is low • if two documents have a cannot-link, they cannot be put into same group, even if their text similarity is high • if two documents have a family-link, increase their text similarity score, so that their chance of being in the same group increases.

Redundancy-based Reference Copy Detection • Apply hash function to the document string (all words in a document concatenated together) • NIST’s security hash function: SHA1 • For each document, there is a unique hash value for it • Sort the <document id, hash-value> tuples by the hash value • Same hash values stay together • Linear scan to the sorted list • Same hash value indicates exact duplicates • The reference copies are selected as the one with the earliest timestamp in an exact duplicate group size bigger than 5

Evaluation • Assessors (from coding lab in University of Pittsburgh) manually organized documents into near-duplicate clusters • Compare human-human agreement to human-computer agreement

Experimental Results • Comparing with human-human intercoder agreement • Metric: AC1 • A modified version of Kappa

Experimental Results • Comparing with other duplicate detection Algorithms • Metric: F1

Impact of Instance-level Constraints • Number of Constraints vs. F1.

Conclusion • Near-duplicate detection on large public comment datasets is practical • Instance-based constrained clustering/semi-supervised clustering • Efficient • Greater control over the clustering • Encourages use of other forms of evidence • Easily applied to other datasets

Thank You! Questions?

Near-Duplicate Detection by Instance-level Constrained Clustering