1 / 16

Near-Duplicate Detection for eRulemaking

Near-Duplicate Detection for eRulemaking. Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University. Stuart Shulman Library and Information Science School of Information Sciences University of Pittsburgh. Duplicates and

renee-frye
Download Presentation

Near-Duplicate Detection for eRulemaking

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Near-Duplicate Detection for eRulemaking Hui Yang, Jamie Callan Language Technologies Institute School of Computer Science Carnegie Mellon University Stuart Shulman Library and Information ScienceSchool of Information Sciences University of Pittsburgh dg.o conference 2006

  2. Duplicates and Near-Duplicates in eRulemaking • U.S. regulatory agencies must solicit, consider, and respond to public comments. • Special interest groups make form letters available for generating comments via email and the Web • Moveon.org, http://www.moveon.org • GetActive, http://www.getactive.org • Modifying a form letter is very easy dg.o conference 2006

  3. Form Letters • Insert screen shot of moveon.org, showing form letter and enter-your-comment-here Form Letter Individual Information Personal Notes dg.o conference 2006

  4. Duplicate - Exact dg.o conference 2006

  5. Near Duplicate - Block Edit dg.o conference 2006

  6. Near Duplicate – Minor Change dg.o conference 2006

  7. Minor Change + Block Edit dg.o conference 2006

  8. Near Duplicate - Block Reordering dg.o conference 2006

  9. Near Duplicate – Key Block dg.o conference 2006

  10. Near-duplicate Detection Strategy • Group Near-duplicates based on • Text similarity • Similar Vocabulary • Similar Word Frequencies • Editing patterns • Metadata • Hints to the clustering algorithm about how to group documents dg.o conference 2006

  11. Must-links • Two instances must be in the same cluster • Created when • complete containment of the reference copy (key block), • word overlap > 95% (minor change). dg.o conference 2006

  12. Cannot-links • Two instances cannot be in the same cluster • Created when two documents • cite different docket identification numbers • People submitted comments to wrong place dg.o conference 2006

  13. Family-links • Two instances are likely to be in the same cluster • Created when two documents have • the same email relayer, • the same docket identification number, • similar file sizes, or • the same footer block. dg.o conference 2006

  14. Experimental Results Comparing with human-human intercoder agreement (measured in AC1) USEPA-OAR-2002-0056 (EPA Mercury dataset) USDOT-2003-16128 (DOT SUV dataset) dg.o conference 2006

  15. Experimental Results Comparing with other duplicate detection Algorithms (measured in F1) dg.o conference 2006

  16. Impact of Instance-level Constraints • Number of Constraints vs. F1. dg.o conference 2006

More Related