1 / 27

Near-Duplicate Detection by Instance-level Constrained Clustering

This study explores near-duplicate detection in the context of eRulemaking, using instance-level constrained clustering and incorporating document attributes, content structure, and pairwise relationships to improve accuracy and efficiency.

arthurh
Download Presentation

Near-Duplicate Detection by Instance-level Constrained Clustering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Near-Duplicate Detection by Instance-level Constrained Clustering Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University

  2. Introduction Near-Duplicate Detection • To identify and organize “nearly-identical” documents • Different definition of “similarity” from other fields • Database: Almost-identical documents • Finger-prints based approaches • Only allow small changes to the texts • Sensitive to text positions • Information Retrieval: Relevant documents • Bag-of-word approaches • Measure overlap of the vocabulary • Focus more on semantic similarity while near-duplicates more on syntactic (surface text) similarity • Cannot identify near-duplicates when they only share a small amount of text

  3. Near-Duplicate Detection in eRulemaking • U.S. regulatory agencies receive and deal with large amount of public comments everyday • By law, they need to read each of them • Many of them are “Form Letters” • Generate comments based on form letters provided by online special interest groups • http://www.moveon.org • http://www.getactive.com • Need to automate the duplicate detection process and save human effort

  4. Editing Styles • Block Added: Add one or more paragraphs (<200 words) to a document; • Block Deleted: Remove one or more paragraphs (<200 words) from a document; • Key Block: Contains at least one paragraph from a document; • Minor Change: A few words altered within a paragraph (<5% or 15 word change in a paragraph) ; • Minor Change & Block Edit: A combination of minor change and block edit; • Block Reordering: Reorder the same set of paragraphs; • Repeated: Repeat the entire document several times in another document; • Bag-of-word similar: >80% word overlap (not in above categories); and • Exact: 100% word overlap.

  5. “Key Block” Problem

  6. Need More Flexible Framework • Need to use additional knowledge from the document collection • Instance-level Constrained Clustering • A semi-supervised clustering approach to incorporate additional knowledge • Document attributes • Content structure • Pair-wise relationships

  7. Instance-level Constrained Clustering • Instance-level Constraints • Pair-wise • Easy to generate • Cannot generate class labels • Weaker condition than semi-supervised classification • Types of Constraints • Must-links, cannot-links, family-links

  8. Must-links • Two instances must be in the same cluster • Created when • complete containment of the reference copy (key block), • word overlap > 95% (minor change).

  9. Cannot-links • Two instances cannot be in the same cluster • Created when two documents • cite different docket identification numbers • People submitted comments to wrong place

  10. Family-links • Two instances are likely to be in the same cluster • Created when two documents have • the same email relayer, • similar file sizes, or • the same footer block.

  11. + + + + + + + + + + + + + + + + + + + + Must-links Group the Corrects

  12. + + + + + + + + + + + + + + + + + - + - Cannot-links Push Away Wrongs

  13. + + + + + + + + + + + + + + + + + + + + Family-links Attract the Similars

  14. Constraint Transitive Closure • An initial set of constraints are created for pairs of documents • Taking transitive closure over the constraints • Must-link transitive closure: da=m db , db=m dc => da=m dc • Cannot-link transitive closure: da=c db , db=m dc => da=c dc • Family-link transitive closure: da=f db , db=m dc => da=f dc da=f db , db=c dc => da=c dc da=f db , db=f dc => da=f dc ( =m, =c and =f indicate must-link, cannot-link and family-link respectively.)

  15. Constraint Transitive Closure • Example:

  16. F F F F F F Document-Space With Initial Links Form letter Cannot link Must link Family link

  17. F F F F F F Document-Space After Link Propagation Form letter Cannot link Must link Family link

  18. Incorporating the Constraints • When forming clusters, • if two documents have a must-link, they must be put into same group, even if their text similarity is low • if two documents have a cannot-link, they cannot be put into same group, even if their text similarity is high • if two documents have a family-link, increase their text similarity score, so that their chance of being in the same group increases.

  19. Redundancy-based Reference Copy Detection • Apply hash function to the document string (all words in a document concatenated together) • NIST’s security hash function: SHA1 • For each document, there is a unique hash value for it • Sort the <document id, hash-value> tuples by the hash value • Same hash values stay together • Linear scan to the sorted list • Same hash value indicates exact duplicates • The reference copies are selected as the one with the earliest timestamp in an exact duplicate group size bigger than 5

  20. Evaluation • Assessors (from coding lab in University of Pittsburgh) manually organized documents into near-duplicate clusters • Compare human-human agreement to human-computer agreement

  21. Experimental Results • Comparing with human-human intercoder agreement • Metric: AC1 • A modified version of Kappa

  22. Experimental Results • Comparing with other duplicate detection Algorithms • Metric: F1

  23. Impact of Instance-level Constraints • Number of Constraints vs. F1.

  24. Impact of Instance-level Constraints • Number of Constraints vs. F1.

  25. Conclusion • Near-duplicate detection on large public comment datasets is practical • Instance-based constrained clustering/semi-supervised clustering • Efficient • Greater control over the clustering • Encourages use of other forms of evidence • Easily applied to other datasets

  26. Thank You! Questions?

More Related