1 / 28

Information Extraction

Information Extraction. Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information Science National Taiwan University. Outline. Introduction Information extraction Metadata Text processing techniques

faith
Download Presentation

Information Extraction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information Science National Taiwan University

  2. Outline • Introduction • Information extraction • Metadata • Text processing techniques • Message understanding conference • Future researches

  3. Information Services • Keyword searching • Information retrieval (Document retrieval) • Information filtering • Information extraction • Information summarization • Information understanding

  4. Information Extraction? • A task draws out some information from documents based on predefined templates. • A predefined template is a collection of attribute-value pairs. • The templates play the roles of metadata formats but with different faces.

  5. Specificity of an IE Task • Due to the specificity of task, extracting what kind of information is domain-dependent. • For example • MUC-5 : the target documents are news articles about joint ventures and microelectronics • MUC-6 : the target documents of are news articles about management changes

  6. Templates • User-defined templates • Dynamically customized based on user’s information need • Researches of information extraction • Authority-controlled templates • Statically specified by some authorities • Researches of metadata research

  7. Metadata • Metadata is data about data • Metadata is used to describe other information based on some rules or policies • Examples • Person: ID card, driver’s license • Book: MARC

  8. Examples of Metadata • GILS • Government Information Locator Service • FGDC • Federal Geographic Data Committee Standard • CIMI • Consortium for the Computer Interchange of Museum Information

  9. Functions of Metadata • Location • Discovery • Documentation • Evaluation • Selection

  10. What Information? • Person • Event • Time • Place • Object • Relationship

  11. MARC • In order to make the readers or users convenient to find the books in libraries, each book has been cataloged in Machine-Readable Cataloging (MARC) format based on Anglo-American Cataloging Rules, 2nd edition (AACR2). • Take the book “The Electronic Libraries” by Kenneth E. Dowlin as an example.

  12. 001 83021957 //r91 005 19911024125216.4 008 831004s1984 nyua b 00110 eng cam a 010 83021957 //r91 020 0918212758 (pbk.) :|c$24.95 040 DLC|cDLC|dDLC 050 00 Z678.9|b.D68 1984 082 00 025/.04|219 090 Z/678.9/D68/1984///1410222AL/1415924CL/1453410CL/1733896CF 091 TUL|bAL|bCL|bCL|bCF 095 TUL|dZ678.9|eD68|y1984|t095|bAL|c1410222 ... ... 099 TUL|d|e|y|f|t091|b|c|x|z 100 10 Dowlin, Kenneth E 245 14 The electronic library :|bthe promise and the process / |cKenneth E. Dowlin 260 0 New York, N.Y. :|bNeal-Schuman Publishers,|cc1984 300 xi, 199 p. :|bill. ;|c23 cm 440 0 Applications in information management and technology series 504 Includes bibliographical references and index 650 0 Libraries|xAutomation 650 0 Information technology 910 8'93 D#139 MCL

  13. Dublin Core • A simple metadata format • For the networked information • Contain 15 elements

  14. Elements of Dublin Core

  15. Automaticity • It is needed to develop some automatic or semi-automatic procedures to “catalog” these existed homepages or other untagged documents without large human efforts. • Researches of information extraction cast light on the resolution to these problems.

  16. Complexity and Automaticity of Metadata Format complexity automaticity

  17. Components of IE Systems • Tokenization module • Stemming module • Word segmentation module • Lexical analysis module • Syntactic analysis module • Domain knowledge module

  18. Techniques for Text Processing • Researches of natural language processing (NLP) have developed many high-performance analysis systems. • The performance of tokenization module is about 98% correct rate [Palmer and Hearst, 1994]. • The difficulty of this part is to distinguish whether periods are full-stop or part of abbreviations.

  19. Techniques for Text Processing (continued) • The Stemming module is also good enough. • Porter algorithm [Porter, 1980] • Two-level morphology [Koskenniemi, 1983]. • Lexical analysis module, the most improved part of researches of NLP in recent years. • Probabilistic tagger [Church, 1988] • Rule-based tagger [Brill, 1992] • Hybrid tagger [Voutilainen, 1993] • Finite-state tagger [Kempe, 1997]

  20. Word Segmentation • Chinese word segmentation • 將黃大目的確實行動作了解釋 (改寫自張俊盛教授舉的例子) • 將黃大目的確實行動作了解釋 • Segmentation approach • CKIP, SINICA • BDC • NLP, NTHU • NLPL, NTU • Take proper nouns into consideration

  21. Syntactic Analysis • The most challenging work • From the viewpoint of NLP, the correct and complete parse tree is very important • For applications like IR and IE, time is the most critical factor • Leverage time and correctness factors is important • Partial parsing

  22. Partial Parsing • Fidditch [Hindle, 1983] • Chunker • Rule-based chunker [Abney, 1991] • Probabilistic chunker [Chen and Chen, 1993] • Transformational-based parser [Brill, 1993] • Probabilistic binary parser [Chen, 1998] • Finite-state parser

  23. Message Understanding Conference • A gathering of researchers in natural language processing • Conference participants must develop NLP systems that perform a variety of information extraction tasks • Each system's performance is evaluated by comparing its output with the output of human linguists

  24. MUC Tasks • MUC-1 (1987) and MUC-2 (1989) • naval operations • MUC-3 (1991) and MUC-4 (1992) • terrorist activity • MUC-5 (1993) • joint ventures and microelectronics • MUC-6 (1995) • management changes

  25. MUC-6 Tasks • Named Entity (NE) requires only that the system under evaluation identify each bit of pertinent information in isolation from all others. • person names • company names • organization names • Coreference (CO) requires connecting all references to "identical" entities. • Template Element (TE) requires grouping entity attributes together into entity "objects." • location • dates, times, currency

  26. Results of MUC-6

  27. MUC-7 Tasks (1998) • Name Entity (NE) • Coreference (CO) • Template Element (TE) • Template Relationship (TR) requires identifying relationships between template elements. • Scenario Template (ST) requires identifying instances of a task-specific event and identifying event attributes, including entities that fill some role in the event; the overall information content is captured via interlinked "objects."

  28. Future Researches • Dynamic templates gradually shift to static metadata through user study • High-performance, fast parsing algorithm • Discourse analysis • Summarization as information extraction • Multimedia, intermedia consideration • Multimodal, intermodal consideration

More Related