Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis ma

Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September 13, 2001

Our Boolean Origins

The Topic Identification System • The Topic Identification System Model • Term-based Topic Identification (TTI) • Term Mapping System • Company Concept Indexing • Named Entity Indexing (Companies, People, Organizations, Places) • Subject Indexing Prototype (not released) • NEXIS Topical Indexing

Psycholinguistics Features • Propositional Language Model Underlies Surface Forms • Word Concepts • Semantic Priming, Additive up to a Point • Spreading Activation

Terms and Word Concepts • All words and phrases are searchable – no stop words • No automatic morphological or thesaurus expansion • Exception – name variant generation, but subject to human verification • Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept

Frequency & Weighting • Frequency & weighting at word concept level rather than at individual term level • TTI used chi-square to compare individual word concepts to supervised training set • TTI used stepwise linear regression to test in combination and suggest weights • Allow both positive and negative weights in addition to absolute yes/no Boolean functionality

Problem Word Concepts 5 documents: 3 relevant (G), 2 irrelevant (B) W1 in G1, G2, B1 W2 in G2, G3, B2 W3 in G1, G3, B1 Each W by itself produces 67% recall, 67% precision W1 + W2 -> 100% recall, 60% precision W1 + W3 -> 100% recall, 75% precision W2 + W3 -> 100% recall, 60% precision W1 + W2 + W3 -> 100% recall, 60% precision Also, fewer terms -> faster processing

Looking Up Terms in Documents • Count a term extra in key document parts • Headlines • Leading text • Captions • Count all potential matches • American gets counted for 100s of companies • Don’t count a term when part of another • Mead in Mead Corp. • French in French Fry

Calculating Topic Scores • Summation of frequency * weight across all word concepts • Normalize score • Compare to threshold • Verification range in TTI • Major references, strong passing references, weak passing references in indexing tools • Add controlled vocabulary term or marker to document if score >= threshold • Add score, any associated secondary CVTs

Source-dependent, -independent • Similar field functions, different field names and locations • Database and file information to guide production processes The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types

Manual vs. Automatic • Build each definition using iterative manual process • Use supervised learning? • TTI’s chi-square and regression • Cost of creating training samples • Automate repetitive, labor-intensive tasks • Generate name variants • Cheap labor cost – few minutes to 8 hours

Test, Test, Test • Business unit benchmarks prior to adoption • Development process test cases • Internal benchmarks with 3rd party technologies • Sorry, not TREC • Most tests, topics, sources – recall and precision both in the 90-95% range

The End? • TIS Model? 16 years old • TTI? In production for 11 years • Term Mapping? 9 years old • Entity Indexing? 6-7 years old • Topical Indexing? 3 years old • Complemented by SRA NetOwl-based indexing 2 years ago • No movement afoot to replace any of them

Related Papers • TTI • Leigh, S. (1991). The Use of Natural Language Processing in the Development of Topic Specific Databases. Proceedings of the 12th National Online Meeting. • Company Concept Indexing • Wasson, M. (2000). Large-scale Controlled Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference.

Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis ma

Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis ma

Presentation Transcript

LexisNexis

Advanced LexisNexis Training

Using LexisNexis Academic

Text Classification

LexisNexis

TEXT CLASSIFICATION

LexisNexis Undercover ! A Workshop Uncovering Basic LexisNexis Search Strategies

Data Mining and Text-based Information Mark Wasson Senior Architect, Research Scientist LexisNexis

Text Classification

LexisNexis AU

LexisNexis Academic

LexisNexis Academic

LexisNexis Newsdesk

Classification Text

Text Classification

TEXT CLASSIFICATION