1 / 25

Information Extraction

Information Extraction. Information Extraction System. Converts unstructured text into a form that can be loaded into a database table Mentions of entities extracted without deep understanding Identifies useful/relevant text in a document Text segment and its associated attributes . Morita.

bowen
Download Presentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction

  2. Information Extraction System • Converts unstructured text into a form that can be loaded into a database table • Mentions of entities extracted without deep understanding • Identifies useful/relevant text in a document • Text segment and its associated attributes Morita Morita said that to overcome the same currency problems, Japan needs to restructure its economy in order to live less from exports and more from domestic demand. Japan

  3. Information Extraction • Names can be identified without deep parse or complete text understanding • Pattern recognition, machine learning algorithms • Part of a higher level application • Question answering • Summarization

  4. Information Extraction : Example List the news reports of car bombings in Basra and surrounding areas between June 2004 and December 2004? • Need semantic information • Date Format • Subattributes of attributes • Conversion from unstructured to structured form • Natural language questions are open-ended whereas SQL queries are NOT! • Polysemy/Synonymity

  5. IE Applications • Financial, legal, medical industries produce/use large amts of text • Medicine • Information extracted from papers • Patients’ response to drugs • Summarize symptoms • Capture gene-drug interactions • Law • Case reports can be mined for • Description of individual cases • Case type(?)-ruling • Court opinion on different case categories • Related cases • Finance • Financial reports or news articles can be mined for • Companies revenues • Earnings • Sales/assets over a period

  6. Positions Available

  7. Positions Available • Company: Overwatch • Location: Fort Leavenworth , KS66027 • Status: Full Time, Employee • Job Category: IT/Software Development • Career Level: Experienced (Non-Manager) • Position Description: • Database Administrator • The overall goal of the TRADOC Intelligence Support Activity is to develop and apply the Contemporary Operational Environment (COE) to training, leader development, and combat developments in order to enhance operational capabilities of Army units. Duty is at TRISA in Fort Leavenworth, KS. • Develop a culture-based data standards database construct incorporating all schema (data model) types to be considered as a candidate for the Army’s common culture data standard—i.e., culture data standards to drive army non-kinetic simulations/models

  8. Positions Available (cont) • Develop a conceptual schema (data model) consisting of entity classes (representing things of significance in the domain) and relationships (assertions about associations between pairs of entity classes). • Develop a logical schema (data model) consisting of descriptions of tables and columns, object oriented classes, and XML tags, among other things. • Develop a physical schema (data model) consisting of partitions, CPUs, table-spaces, and the like.

  9. Positions Available (cont) • 5-10 years DB administration experience • BS Computer Science minimum; MCDBA certification or any related certification(s) a plus • Expert at designing, integrating, all three database schema (conceptual, • logical, and physical). • Familiar with data mining techniques and various methodologies of translating text files into data models (Visual Basic.NET, Advanced knowledge of SQL, MS Access, and Postgres). • Top Secret security clearance with SCI access. • Be willing to learn/ramp-up on BLUFOR and OPFOR/COE doctrine, organization, tactics, techniques, and procedures. • Become familiar with Future Force organization and maneuver concepts. • Excellent briefing and writing skills; familiar with Microsoft Word, EXCEL, Power Point, and ACCESS. • Be a team player/builder.

  10. Sales Information

  11. Sales Information

  12. Sales Information HP zv6000 Notebook Featuring AMD Sempro Processor DELL XPS 3.6 GHz 1GB RAID 0 DVDRW CDRW 20” LCD XP PRO $500 OFF Reflected in Price $2,499.99 DELL 4700 P4 540 3.2 GHz 512MB, 160GB DVD+/-RW, DVD 19” LCD XP Home/Works Suite $1,199.99 Dell 8400 P4 3.4 GHz 1 GB, 250GBB DVDRW,DVD 19” LCP XP Pro Works Suite $1,699.99

  13. Intelligence from news Articles v1 BAGHDAD, Iraq – Police reported that insurgents in two separate attacks had killed the of a Baghdad police station and four officers on Monday. The head of the Balat-al-Shouhada police station, Col. Abdul KahrimFahad and his driver were killed in a drive-by shooting on Monday morning. The attack occurred when Col. Fahad was on his way to the police station in southeastern Baghdad.

  14. Intelligence from news Articles v2 • Insurgents struck Iraqi security forces Monday, killing the head of a Baghdad police station and four other officers in separate attacks, police said. • Col. Abdul KahrimFahad, head of the Balat al-Shouhada police station, and his driver were gunned down in a drive-by shooting. • Earlier Monday, a roadside bomb exploded near an Iraqi police patrol in southwestern Baghdad, killing one Iraqi policeman and wounding five other people, including three Iraqi police.

  15. Entity Extraction • Classification problem • Not every word is associated with a semantic class • Two phases • Identify potential entity words • Classify into entity types • Lists of entities • All inclusive list of entities (?) • Names in more than one list (?) • Machine Learning Techniques

  16. Entity Extraction • Tokens/tags • Sentence analysis • Merging of multiple references to the same entity • Extraction • Population of db tables

  17. IE Systems • Tokenization and Tagging • Sentence Analysis POS Tokens TEXT Groups POS Tags • Extractor • Template Generation • Merging Assigned Combined Entities Entities

  18. Difficulties in Entity Extraction • Words in multiple lists • Boundary problem • Use of conjunction/disjunction • Embedded NEs • Abbreviations • Acronyms

  19. MUC-6 • Markup Description • The output of the systems to be evaluated will be in the form of SGML text markup. The only insertions allowed during tagging are tags enclosed in angled brackets. No extra whitespace or carriage returns are to be inserted; otherwise, the offset count would change, which would adversely affect scoring.The markup will have the following form: • <ELEMENT-NAME ATTR-NAME="ATTR-VALUE" ...>text-string</ELEMENT-NAME> • Example: • <ENAMEX TYPE="ORGANIZATION">Taga Co.</ENAMEX> • The markup is defined in SGML Document Type Descriptions (DTDs), written for MUC-6 use by personnel at MITRE and maintained by personnel at NRaD. The DTDs enable annotators and system developers to use SGML validation tools to check the correctness of the SGML-tagged texts produced by the annotator or the system. The validation tools are available to MUC-6 participants in the file called muc6-sgml-tools. Annotators are using a software tool provided for MUC-6 by SRA Corporation to assist in generating the answer keys to be used for system training and testing.

  20. MUC-6 • Named Entities (ENAMEX tag element) • This subtask is limited to proper names, acronyms, and perhaps miscellaneous other unique identifiers, which are categorized via the TYPE attribute as follows: • ORGANIZATION: named corporate, governmental, or other organizational entity • PERSON: named person or family • LOCATION: name of politically or geographically defined location (cities, provinces, countries, international regions, bodies of water, mountains, etc.)

  21. MUC-6 • Temporal Expressions (TIMEX tag element) • This subtask is for "absolute" temporal expressions only; explanation is provided in appendix B. The tagged tokens are categorized via the TYPE attribute as follows: • DATE: complete or partial date expression • TIME: complete or partial expression of time of day

  22. MUC-6 • Number Expressions (NUMEX tag element) • This subtask is for two useful types of numeric expressions, monetary expressions and percentages. The numbers may be expressed in either numeric or alphabetic form.The task covers the complete expression, which is categorized via the TYPE attribute as follows: • MONEY: monetary expression • PERCENT: percentage

  23. Excerpt from MUC-6 dataset

  24. Filled Template

More Related