Information Extraction

Information Extraction Kuang-hua Chen khchen@ccms.ntu.edu.tw Language & Information Processing System Lab. (LIPS) Department of Library and Information Science National Taiwan University

Outline • Introduction • Information extraction • Metadata • Text processing techniques • Message understanding conference • Future researches

Information Services • Keyword searching • Information retrieval (Document retrieval) • Information filtering • Information extraction • Information summarization • Information understanding

Information Extraction? • A task draws out some information from documents based on predefined templates. • A predefined template is a collection of attribute-value pairs. • The templates play the roles of metadata formats but with different faces.

Specificity of an IE Task • Due to the specificity of task, extracting what kind of information is domain-dependent. • For example • MUC-5 : the target documents are news articles about joint ventures and microelectronics • MUC-6 : the target documents of are news articles about management changes

Templates • User-defined templates • Dynamically customized based on user’s information need • Researches of information extraction • Authority-controlled templates • Statically specified by some authorities • Researches of metadata research

Metadata • Metadata is data about data • Metadata is used to describe other information based on some rules or policies • Examples • Person: ID card, driver’s license • Book: MARC

Examples of Metadata • GILS • Government Information Locator Service • FGDC • Federal Geographic Data Committee Standard • CIMI • Consortium for the Computer Interchange of Museum Information

Functions of Metadata • Location • Discovery • Documentation • Evaluation • Selection

What Information? • Person • Event • Time • Place • Object • Relationship

MARC • In order to make the readers or users convenient to find the books in libraries, each book has been cataloged in Machine-Readable Cataloging (MARC) format based on Anglo-American Cataloging Rules, 2nd edition (AACR2). • Take the book “The Electronic Libraries” by Kenneth E. Dowlin as an example.

001 83021957 //r91 005 19911024125216.4 008 831004s1984 nyua b 00110 eng cam a 010 83021957 //r91 020 0918212758 (pbk.) :|c$24.95 040 DLC|cDLC|dDLC 050 00 Z678.9|b.D68 1984 082 00 025/.04|219 090 Z/678.9/D68/1984///1410222AL/1415924CL/1453410CL/1733896CF 091 TUL|bAL|bCL|bCL|bCF 095 TUL|dZ678.9|eD68|y1984|t095|bAL|c1410222 ... ... 099 TUL|d|e|y|f|t091|b|c|x|z 100 10 Dowlin, Kenneth E 245 14 The electronic library :|bthe promise and the process / |cKenneth E. Dowlin 260 0 New York, N.Y. :|bNeal-Schuman Publishers,|cc1984 300 xi, 199 p. :|bill. ;|c23 cm 440 0 Applications in information management and technology series 504 Includes bibliographical references and index 650 0 Libraries|xAutomation 650 0 Information technology 910 8'93 D#139 MCL

Dublin Core • A simple metadata format • For the networked information • Contain 15 elements

Elements of Dublin Core

Automaticity • It is needed to develop some automatic or semi-automatic procedures to “catalog” these existed homepages or other untagged documents without large human efforts. • Researches of information extraction cast light on the resolution to these problems.

Complexity and Automaticity of Metadata Format complexity automaticity

Components of IE Systems • Tokenization module • Stemming module • Word segmentation module • Lexical analysis module • Syntactic analysis module • Domain knowledge module

Techniques for Text Processing • Researches of natural language processing (NLP) have developed many high-performance analysis systems. • The performance of tokenization module is about 98% correct rate [Palmer and Hearst, 1994]. • The difficulty of this part is to distinguish whether periods are full-stop or part of abbreviations.

Techniques for Text Processing (continued) • The Stemming module is also good enough. • Porter algorithm [Porter, 1980] • Two-level morphology [Koskenniemi, 1983]. • Lexical analysis module, the most improved part of researches of NLP in recent years. • Probabilistic tagger [Church, 1988] • Rule-based tagger [Brill, 1992] • Hybrid tagger [Voutilainen, 1993] • Finite-state tagger [Kempe, 1997]

Word Segmentation • Chinese word segmentation • 將黃大目的確實行動作了解釋 (改寫自張俊盛教授舉的例子） • 將黃大目的確實行動作了解釋 • Segmentation approach • CKIP, SINICA • BDC • NLP, NTHU • NLPL, NTU • Take proper nouns into consideration

Syntactic Analysis • The most challenging work • From the viewpoint of NLP, the correct and complete parse tree is very important • For applications like IR and IE, time is the most critical factor • Leverage time and correctness factors is important • Partial parsing

Partial Parsing • Fidditch [Hindle, 1983] • Chunker • Rule-based chunker [Abney, 1991] • Probabilistic chunker [Chen and Chen, 1993] • Transformational-based parser [Brill, 1993] • Probabilistic binary parser [Chen, 1998] • Finite-state parser

Message Understanding Conference • A gathering of researchers in natural language processing • Conference participants must develop NLP systems that perform a variety of information extraction tasks • Each system's performance is evaluated by comparing its output with the output of human linguists

MUC Tasks • MUC-1 (1987) and MUC-2 (1989) • naval operations • MUC-3 (1991) and MUC-4 (1992) • terrorist activity • MUC-5 (1993) • joint ventures and microelectronics • MUC-6 (1995) • management changes

MUC-6 Tasks • Named Entity (NE) requires only that the system under evaluation identify each bit of pertinent information in isolation from all others. • person names • company names • organization names • Coreference (CO) requires connecting all references to "identical" entities. • Template Element (TE) requires grouping entity attributes together into entity "objects." • location • dates, times, currency

Results of MUC-6

MUC-7 Tasks (1998) • Name Entity (NE) • Coreference (CO) • Template Element (TE) • Template Relationship (TR) requires identifying relationships between template elements. • Scenario Template (ST) requires identifying instances of a task-specific event and identifying event attributes, including entities that fill some role in the event; the overall information content is captured via interlinked "objects."

Future Researches • Dynamic templates gradually shift to static metadata through user study • High-performance, fast parsing algorithm • Discourse analysis • Summarization as information extraction • Multimedia, intermedia consideration • Multimodal, intermodal consideration

Information Extraction