1 / 13

Noisy Text Analytics: An Exercise in Futility?

Noisy Text Analytics: An Exercise in Futility?. Rohini Srihari Janya, Inc. www.janyainc.com. 8 January 2007. Overview: Noisy Text Analytics. All Text is Noisy! Does not fit shrink wrapped processing, adaptation is necessary Business and national security interests in processing:

sage
Download Presentation

Noisy Text Analytics: An Exercise in Futility?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Noisy Text Analytics: An Exercise in Futility? Rohini Srihari Janya, Inc. www.janyainc.com 8 January 2007

  2. Overview: Noisy Text Analytics • All Text is Noisy! • Does not fit shrink wrapped processing, adaptation is necessary • Business and national security interests in processing: • Open source data (e.g. web pages) • Consumer generated media (Blogs, newsgroups, chat, text messaging, etc.) • Key is to identify analysis requirements clearly • Not necessary to understand everything

  3. Challenging Problems • Mixed modalities • Structured and unstructured; free text cannot be processed in a vacuum; need to correlate information from different sections • Text with images, figures • Improve within document information consolidation, Cross-document information consolidation • World models for discourse processing • Need to bring in more context; relate text analytics to semantic web activities (DAML/OWL) • Dynamic use of online resources • Adaptive text analytics • extraction requirements are constantly changing, so is data! • Corpus-based learning • Flexible architectures • Integrating additional preprocessing, handling streaming data etc.

  4. USMTF Document Structure OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN:1111-22222/REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333// GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//

  5. Sample Document Sets OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN:1111-22222/REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333// GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//

  6. Sample Document Fields OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN:1111-22222/REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333// GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'//

  7. Sample Document OPER/BRAVE CHILD// MSGID/BDAREP PHASE2/NMJIC/F-0005// BDAREPID/BEN:1111-22222/REPCOUNT:1// ICOD/011630ZJAN2002// BDACELL/NMJIC/TEL:COM 777-666-9999/TEL:DSN 222-9999/SECTEL:999-3333// GENTEXT/PURPOSE/THIS PHASE 2 BDA REPORT IS AN ALL-SOURCE ASSESSMENT CONTAINING DETAILED PHYSICAL AND FUNCTIONAL DAMAGE ASSESSMENTS, INPUTS TO THE TARGET SYSTEM ASESSMENT, AND COMMENTS ON MUNITION EFFECTIVENESS. PHASE 2 IMAGERY, IF PRODUCED, CAN BE LOCATED ON THE IMAGERY SERVER USING THE KEYWORD 'BDA.'// Free-text field

  8. Sample Document Entity Description/Name Field TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0 // ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON /MAXRECUP:6MON// GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2 OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND IS FUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES IS CLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPIT VIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATING TO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRE SIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TO RECONSTITUTE C2 EQUIPMENT//

  9. Sample Document Reference to Structured Sets from Free Text TGTELEM/PLANNED:Y/-/TGTEL:C2 OPERATIONS BLDG/TMPAGE:G3/TMGRID:B.5-S.0 // ELEMDMG/PHYDMG:SVR/CONF:CONF/FUNCDMG:DES/STCHG:Y/MINRECUP:3MON /MAXRECUP:6MON// GENTEXT/DAMAGE NARRATIVE/ALL-SOURCE INTELLIGENCE CONFIRMS THAT THE C2 OPERATIONS BUILDING HAS SUFFERED SEVERE INTERNAL DAMAGE AND IS FUNCTIONALLY DESTROYED. EXTENSIVE SMOKE FROM INTERNAL FIRES IS CLEARLY VISABLE. NUMEROUS FIRE TRUCKS ARE IN THE FACILITY. COCKPIT VIDEO CONFIRMS FOUR WEAPONS IMPACTING, WITH AT LEAST ONE PENETRATING TO THE BASEMENT OF THE BUILDING. ESTIMATE BIG COUNTRY WILL REQUIRE SIGNIFICANT TIME, AND PROBABLE FOREIGN TECHNICAL ASSISTANCE TO RECONSTITUTE C2 EQUIPMENT//

  10. Cross-Document Entity Profile

  11. Corpus-Based Learning • Training phase requires four inputs • Document repository (unlabeled training data) • Config file1 for DTL Context (how to create unlabeled train data) • Seed file (how to label a small amount of unlabeled train data) • Config file2 for Learning Tool • How to learn a model • How to use learned model in Semantex Seed File Document Repository DTL Context Learning Tool Trainer Training Data Config File2 Config File1 Learned Model

  12. Example: Nominal Event Classifier Seedfile: 95 unambiguous event nominals, 295 unambiguous nonevent nominals Repository: News texts processed by Semantex Config file (DTL): Look at features surrounding nouns Config file (LearningTool): Learn using a mixture model Example: Disease outbreak Classifier Seedfile: 10 verb types representative of disease outbreak Repository: Medical reports processed by Semantex Config file (DTL): Look at features surrounding verbs Config file (LearningTool): Learn using distributional similarity Versatility of learning tool applied to different tasks Example: Name Disambiguation • Are two instances of Tom Smith the same individual?

  13. Conclusions • Dealing with noisy text is not a futile exercise! • Already commercial applications available • Need to specify analysis requirements clearly • Adapt IE technology appropriately

More Related