820 likes | 1.12k Views
Information Extraction. Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology. Outline. Information Extraction Introduction Applications Table Reading Citation Extraction Chinese Named Entity Recognition. Introduction. Information Extraction.
E N D
Information Extraction Shih-Hung Wu Assistant Professor CSIE, Chaoyang University of Technology
Outline • Information Extraction • Introduction • Applications • Table Reading • Citation Extraction • Chinese Named Entity Recognition
Information Extraction • “extracts pieces of information that are salient to the user's needs”
Message Understanding Conferences (MUC) Evaluations • provide prepared data and task definitions in addition to providing fully automated scoring software to measure machine and human performance. • The databases now include named entities, multilingual named entities, attributes of those entities, facts about relationships between entities, and events in which the entities participated. • The multilingual portion was known as "Multilingual Entitity Task (MET)"
Examples The following fictional news story portrays the levels of detail that systems can extract: Fletcher Maddox, former Dean of the UCSD Business School, announced the formation of La Jolla Genomatics together with his two sons. La Jolla Genomatics will release its product Geninfo in June 1999. Geninfo is a turnkey system to assist biotechnology researchers in keeping up with the voluminous literature in all aspects of their field. Dr. Maddox will be the firm's CEO. His son, Oliver, is the Chief Scientist and holds patents on many of the algorithms used in Geninfo. Oliver's brother, Ambrose, follows more in his father's footsteps and will be the CFO of L.J.G. headquartered in the Maddox family's hometown of La Jolla, CA.
Attributes: Attributes:
Events: COMPANY-FORMATION_EVENT: RELEASE-EVENT:
Information Extraction • current indicators of the state of the art: Items of Information Percentile Reliability Entities 90 Attributes 80 Facts 70 Events 60
Technical definition of IE • The process of creating database entries by skimming a text and looking for occurrences of a particular class of object or event and for relationships among those objects and events [Russell, Norvig 2003]
Basic IE tasks • Extract addresses from Web pages • target: street, city, state, and zip code • Extract storms from weather report • target: temperature, wind speed, and precipitation
IE Applications • Competitive intelligence • find instances of corporate mergers and joint ventures. • Intelligence gathering • terrorist activities. • any damage to buildings or the infrastructure, as well as the time and location of the event. • Health care delivery • summarize medical patient records by extracting diagnoses, symptoms, physical findings, test results, and therapeutic treatments..
Technology • Method in literature • Regular expressions • Cascaded finite-state transducers • Our approaches • Ontological domain knowledge • Machine Learning • Hybrid method
Regular expression approach example • From the text • “17in SXGA Monitor for only $249.99” • Extract • m m ComputerMonitors ΛSize(m,Inches(17))ΛPrice(m, $(249.99))Λ Resolution(m, 1280×1024)
[0-9] [0-9]+ .[0-9] [0-9] (.[0-9] [0-9])? $[0-9]+(.[0-9] [0-9])? Any digit from 0 to 9 One or more digits A period followed by two digits A period followed by two digits, or nothing $249.99, $1.23, $100000, … Regular Expressions matches
Weakness • What’s the price ? • “List price $99.00, special sale price $78.00, shipping $3.00.”
Cascaded finite-state transducers approach example • From • “Bridgestone Sports Co. said Friday it has set up a joint venture in Taiwan with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan.” • Extract • e JointVentures Λ Product(e, “golf clubs”) Λ Date(e,”Friday”) Λ Entity(e,”Bridgetstone Sports Co”) Λ Entity(e, “a local concern”) ΛEntity(e, “a Japanese trading house”)
Cascaded finite-state transducers • A typical relational extraction systems consists of the following five stages: • Tokenization • Complex word handling • Basic group handling • Complex phrase handling • Structure merging
Tokenization • Word segmentation • 土地公有政策->土地|公有|政策, 土地公|有政策 • Complex word handling • “Bridgestone Sports Co.” • CapitalizedWord+(“Company”|”Co”|”Inc”|”Ltd”) • “Intel Chairman Andy Grove” • CapitalizedWord+(“Grove”|”Forest”|”Village”|…) • 謝深”山”、郝柏”村”
Basic group handling • Noun group, verb group, Preposition, Conjunction 1 NG: Bridgestone Sports Co. 2 VG: said 3 NG: Friday 4 NG: it 5 VG: had set up 6 NG: a joint venture 7 PR: in 8 NG: Taiwan 9 PR: with 10 NG: a local concern 11 CJ: and 12 NG: a Japanese trading house 13 VG: to produce 14 NG: golf clubs 15 VG: to be shipped 16 PR: to 17 NG: Japan
Complex phrase handling • Company+SetUp JointVenture (“with” Company+)? • Structure merging • If the next sentence says something about the same event.
A brief remark • IE works well for a restricted domain • Predetermine the Subjects and how they are mentioned
Table Reading • Citation Extraction • Chinese NER
Semantic Search on Internet Tabular Information Extraction for Answering Queries CIKM 2000
Table Reading Gives a algorithm to interpret tables of the type shown below where some cells span over multiple rows or columns. An example of interpretation is: (Attribute)=>(Value) (Adult-Price-Single Room-Economic class)=>35,450
HTML Table C-I Table Layout Description Layout Transition Rule Database Table Method Ambiguous Tagging Relations of Cells Layout Recognition Layout Transformation
Method Tagging Layout Identifying Layout Trans.
C: Departure City C: Arrival City C: Departure Information Concept v.s. Descent Concept Concept v.s. Instance of the Concept I: Departure City I: Arrival City Instance v.s. Instance of the same Concept Tagging C: Departure City I: Departure City
Four Relations of Table Cells • Relations of Concept - Instances • Concept - Instance of the Concept • Concept - Descent Concept • Concept - Instance of Descent Concept • Instance - Instance of the same Concept
Layout Recognition C-I Table Layout Descriptions Template Matching Defined by Layout Syntax Grammar Matched Layout Description
Layout Transformation Origin Layout Description Destination Layout Description
Experiments • 23 tables from 23 web pages • 13 2-dimension tables, 10 complex tables • Success is no miss, Any miss results fail
Conclusion & Future Works • Layout Transformation from complex tables to simple tables (1D, 2D). • A general approach • 1. Tagging • 2. Semantic Layout Recognition • 3. Layout Transformation • Ambiguous reduced by checking cell relations
Reference • Huei-Long Wang, Shih-Hung Wu, I. C. Wang, Cheng-Lung Sung, W. L. Hsu, W. K. Shih, Semantic Search on Internet Tabular Information Extraction for Answering Queries, Ninth International Conference on Information and Knowledge Management (CIKM-2000), McLean, VA, November 6-11, 2000. pp. 243-249. (EI) • H.-H. Chen, S.-C. Tsai, and J.-H. Tsai., Mining Tables from Large Scale HTML Texts, In Proc. 18th International Conference on Computational Linguistics, Saabrucken, Germany, July 2000.
Introduction • Integration of the bibliographical information of scholarly publications available on the Internet • Accurate reference metadata extraction from heterogeneous reference sources. • We propose a knowledge-based approach to reference metadata extraction • INFOMAP: ontological knowledge representation framework • Automatically extract the reference metadata.
Phase 1 Reference Data Collection • Journal Spider (journal agent) • collect journal data from the Journal Citation Reports (JCR) indexed by the ISI and digital libraries on the Web. • Citation data source • ISI web of science • DBLP • Citeseer • PubMed
Phase 2 Domain Knowledge
INFOMAP • INFOMAP as ontological knowledge representation framework • extracts important citation concepts from a natural language text. • Feature of INFOMAP • represent and match complicated template structures • hierarchical matching • regular expressions • semantic template matching • frame (non-linear relations) matching • Using INFOMAP, we can extract author, title, journal, volume, number (issue), year, and page information from different kinds of reference formats or styles.
Phase 3 Reference Metadata Extraction Table 1. Examples of different journal reference styles
Phase 4 Knowledge-based Reference Metadata Extraction - Online Service
Citation Extraction From Text to BixTex @article{ Author = {W. L. Hsu}, Title = {The coloring and maximum independent set problems on planar perfect graphs,"}, Journal = {J. Assoc. Comput. Machin.}, Volume = {}, Number = {}, Pages = {535-563}, Year = {1988 }} @article{ Author = {W. L. Hsu}, Title = {On the general feasibility test of scheduling lot sizes for several products on one machine,"}, Journal = {Management Science}, Volume = {29}, Number = {}, Pages = {93-105}, Year = {1983 }} @article{ Author = {W. L. Hsu}, Title = {The distance-domination numbers of trees,"}, Journal = {Operations Research Letters}, Volume = {1}, Number = {3}, Pages = {96-100}, Year = {1982 }} W. L. Hsu, "The coloring and maximum independent set problems on planar perfect graphs," J. Assoc. Comput. Machin., (1988), 535-563. W. L. Hsu, "On the general feasibility test of scheduling lot sizes for several products on one machine," Management Science 29, (1983), 93-105. W. L. Hsu, "The distance-domination numbers of trees," Operations Research Letters 1, (3), (1982), 96-100. Figure 3. The system input of knowledge-based RME Figure 5. The system output of BibTex Format
System Input (Plain text) System Output Output BibTex Figure 6. The online service of knowledge-based RME (http://bioinformatics.iis.sinica.edu.tw/CitationAgent/)
Experimental Results and Discussion • Experimental data • We used EndNote to collect Bioinformatics citation data for 2004 from PubMed. • A total of 907 bibliography records were collected from PubMed digital libraries on the Web. • Reference testing data was generated for each of the six reference styles (BIOI, ACM, IEEE, APA, MISQ, and JCB). • Randomly selected 500 records for testing from each of the six reference styles.