1 / 15

Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis ma

Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September 13, 2001. Our Boolean Origins . Our Boolean Origins . The Topic Identification System . The Topic Identification System Model

rowland
Download Presentation

Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis ma

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September 13, 2001

  2. Our Boolean Origins

  3. Our Boolean Origins

  4. The Topic Identification System • The Topic Identification System Model • Term-based Topic Identification (TTI) • Term Mapping System • Company Concept Indexing • Named Entity Indexing (Companies, People, Organizations, Places) • Subject Indexing Prototype (not released) • NEXIS Topical Indexing

  5. Psycholinguistics Features • Propositional Language Model Underlies Surface Forms • Word Concepts • Semantic Priming, Additive up to a Point • Spreading Activation

  6. Terms and Word Concepts • All words and phrases are searchable – no stop words • No automatic morphological or thesaurus expansion • Exception – name variant generation, but subject to human verification • Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept

  7. Frequency & Weighting • Frequency & weighting at word concept level rather than at individual term level • TTI used chi-square to compare individual word concepts to supervised training set • TTI used stepwise linear regression to test in combination and suggest weights • Allow both positive and negative weights in addition to absolute yes/no Boolean functionality

  8. Problem Word Concepts 5 documents: 3 relevant (G), 2 irrelevant (B) W1 in G1, G2, B1 W2 in G2, G3, B2 W3 in G1, G3, B1 Each W by itself produces 67% recall, 67% precision W1 + W2 -> 100% recall, 60% precision W1 + W3 -> 100% recall, 75% precision W2 + W3 -> 100% recall, 60% precision W1 + W2 + W3 -> 100% recall, 60% precision Also, fewer terms -> faster processing

  9. Looking Up Terms in Documents • Count a term extra in key document parts • Headlines • Leading text • Captions • Count all potential matches • American gets counted for 100s of companies • Don’t count a term when part of another • Mead in Mead Corp. • French in French Fry

  10. Calculating Topic Scores • Summation of frequency * weight across all word concepts • Normalize score • Compare to threshold • Verification range in TTI • Major references, strong passing references, weak passing references in indexing tools • Add controlled vocabulary term or marker to document if score >= threshold • Add score, any associated secondary CVTs

  11. Source-dependent, -independent • Similar field functions, different field names and locations • Database and file information to guide production processes The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types

  12. Manual vs. Automatic • Build each definition using iterative manual process • Use supervised learning? • TTI’s chi-square and regression • Cost of creating training samples • Automate repetitive, labor-intensive tasks • Generate name variants • Cheap labor cost – few minutes to 8 hours

  13. Test, Test, Test • Business unit benchmarks prior to adoption • Development process test cases • Internal benchmarks with 3rd party technologies • Sorry, not TREC • Most tests, topics, sources – recall and precision both in the 90-95% range

  14. The End? • TIS Model? 16 years old • TTI? In production for 11 years • Term Mapping? 9 years old • Entity Indexing? 6-7 years old • Topical Indexing? 3 years old • Complemented by SRA NetOwl-based indexing 2 years ago • No movement afoot to replace any of them

  15. Related Papers • TTI • Leigh, S. (1991). The Use of Natural Language Processing in the Development of Topic Specific Databases. Proceedings of the 12th National Online Meeting. • Company Concept Indexing • Wasson, M. (2000). Large-scale Controlled Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference.

More Related