160 likes | 270 Views
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September 13, 2001. Our Boolean Origins . Our Boolean Origins . The Topic Identification System . The Topic Identification System Model
E N D
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September 13, 2001
The Topic Identification System • The Topic Identification System Model • Term-based Topic Identification (TTI) • Term Mapping System • Company Concept Indexing • Named Entity Indexing (Companies, People, Organizations, Places) • Subject Indexing Prototype (not released) • NEXIS Topical Indexing
Psycholinguistics Features • Propositional Language Model Underlies Surface Forms • Word Concepts • Semantic Priming, Additive up to a Point • Spreading Activation
Terms and Word Concepts • All words and phrases are searchable – no stop words • No automatic morphological or thesaurus expansion • Exception – name variant generation, but subject to human verification • Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept
Frequency & Weighting • Frequency & weighting at word concept level rather than at individual term level • TTI used chi-square to compare individual word concepts to supervised training set • TTI used stepwise linear regression to test in combination and suggest weights • Allow both positive and negative weights in addition to absolute yes/no Boolean functionality
Problem Word Concepts 5 documents: 3 relevant (G), 2 irrelevant (B) W1 in G1, G2, B1 W2 in G2, G3, B2 W3 in G1, G3, B1 Each W by itself produces 67% recall, 67% precision W1 + W2 -> 100% recall, 60% precision W1 + W3 -> 100% recall, 75% precision W2 + W3 -> 100% recall, 60% precision W1 + W2 + W3 -> 100% recall, 60% precision Also, fewer terms -> faster processing
Looking Up Terms in Documents • Count a term extra in key document parts • Headlines • Leading text • Captions • Count all potential matches • American gets counted for 100s of companies • Don’t count a term when part of another • Mead in Mead Corp. • French in French Fry
Calculating Topic Scores • Summation of frequency * weight across all word concepts • Normalize score • Compare to threshold • Verification range in TTI • Major references, strong passing references, weak passing references in indexing tools • Add controlled vocabulary term or marker to document if score >= threshold • Add score, any associated secondary CVTs
Source-dependent, -independent • Similar field functions, different field names and locations • Database and file information to guide production processes The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types
Manual vs. Automatic • Build each definition using iterative manual process • Use supervised learning? • TTI’s chi-square and regression • Cost of creating training samples • Automate repetitive, labor-intensive tasks • Generate name variants • Cheap labor cost – few minutes to 8 hours
Test, Test, Test • Business unit benchmarks prior to adoption • Development process test cases • Internal benchmarks with 3rd party technologies • Sorry, not TREC • Most tests, topics, sources – recall and precision both in the 90-95% range
The End? • TIS Model? 16 years old • TTI? In production for 11 years • Term Mapping? 9 years old • Entity Indexing? 6-7 years old • Topical Indexing? 3 years old • Complemented by SRA NetOwl-based indexing 2 years ago • No movement afoot to replace any of them
Related Papers • TTI • Leigh, S. (1991). The Use of Natural Language Processing in the Development of Topic Specific Databases. Proceedings of the 12th National Online Meeting. • Company Concept Indexing • Wasson, M. (2000). Large-scale Controlled Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference.