130 likes | 280 Views
Noisy Text Analytics: An Exercise in Futility?. Rohini Srihari Janya, Inc. www.janyainc.com. 8 January 2007. Overview: Noisy Text Analytics. All Text is Noisy! Does not fit shrink wrapped processing, adaptation is necessary Business and national security interests in processing:
E N D
Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. www.janyainc.com 8 January 2007
Overview: Noisy Text Analytics • All Text is Noisy! • Does not fit shrink wrapped processing, adaptation is necessary • Business and national security interests in processing: • Open source data (e.g. web pages) • Consumer generated media (Blogs, newsgroups, chat, text messaging, etc.) • Key is to identify analysis requirements clearly • Not necessary to understand everything
Challenging Problems • Mixed modalities • Structured and unstructured; free text cannot be processed in a vacuum; need to correlate information from different sections • Text with images, figures • Improve within document information consolidation, Cross-document information consolidation • World models for discourse processing • Need to bring in more context; relate text analytics to semantic web activities (DAML/OWL) • Dynamic use of online resources • Adaptive text analytics • extraction requirements are constantly changing, so is data! • Corpus-based learning • Flexible architectures • Integrating additional preprocessing, handling streaming data etc.
USMTF Document Structure OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN:1111-22222/REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333// GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Sample Document Sets OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN:1111-22222/REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333// GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Sample Document Fields OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN:1111-22222/REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333// GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//
Sample Document OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN:1111-22222/REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333// GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'// Free-text field
Sample Document Entity Description/Name Field TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0 // ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON /MAXRECUP:6MON// GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2 OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND IS FUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES IS CLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPIT VIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATING TO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRE SIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TO RECONSTITUTE C2 EQUIPMENT//
Sample Document Reference to Structured Sets from Free Text TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0 // ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON /MAXRECUP:6MON// GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2 OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND IS FUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES IS CLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPIT VIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATING TO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRE SIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TO RECONSTITUTE C2 EQUIPMENT//
Corpus-Based Learning • Training phase requires four inputs • Document repository (unlabeled training data) • Config file1 for DTL Context (how to create unlabeled train data) • Seed file (how to label a small amount of unlabeled train data) • Config file2 for Learning Tool • How to learn a model • How to use learned model in Semantex Seed File Document Repository DTL Context Learning Tool Trainer Training Data Config File2 Config File1 Learned Model
Example: Nominal Event Classifier Seedfile: 95 unambiguous event nominals, 295 unambiguous nonevent nominals Repository: News texts processed by Semantex Config file (DTL): Look at features surrounding nouns Config file (LearningTool): Learn using a mixture model Example: Disease outbreak Classifier Seedfile: 10 verb types representative of disease outbreak Repository: Medical reports processed by Semantex Config file (DTL): Look at features surrounding verbs Config file (LearningTool): Learn using distributional similarity Versatility of learning tool applied to different tasks Example: Name Disambiguation • Are two instances of Tom Smith the same individual?
Conclusions • Dealing with noisy text is not a futile exercise! • Already commercial applications available • Need to specify analysis requirements clearly • Adapt IE technology appropriately