370 likes | 521 Views
CSE 635 Multimedia Information Retrieval. Information Extraction. Overview. Introduction to IE Named Entity tagger HMM approach Relationship/Event detection Text Mining intelligence applications. Information Extraction. What is IE
E N D
CSE 635Multimedia Information Retrieval Information Extraction
Overview • Introduction to IE • Named Entity tagger • HMM approach • Relationship/Event detection • Text Mining • intelligence applications
Information Extraction • What is IE • The identification of instances of a particular class of events or relationships in a natural language text, and the extraction of the relevant arguments of the event or relationship. (MUC, de facto) • Information Extraction involves the creation of a structured representation (such as a database) of selected information drawn from the text. (Grishman 1997) • identification of key entities, relationships between them, and significant activity involving these entities (Srihari) Goals of IE • transform unstructured text into structured/semi-structured text • automatic template-filling • automatically populate databases • facilitate information discovery • sometimes, what you don’t know is most important; if you know what you are looking for, use a search engine! IE permits information discovery
Information to Intelligence Unstructured Data People Company Information Product Entities, relationships, events Ronald Brumback Named Pres. & COO of Top Layer Networks INTC drops X% What caused INTC shares to drop? Top INTC executive, John Doe, leaves to join Transmeta as VP Engineering RF Micro Devices Introduces Cellular CDMA LNA and PA Driver Amplifier with Bypass Switch Microsoft, Lockheed eye federal deals Intelligence C-bridge, eXcelon to merge Text mining, analytics Transmeta Scores Latest Crusoe Win with Sharp FedEx to Cut 130 Jobs in Texas What’s new from RFMD?
Levels of Information Extraction MUC identifies the following levels of extraction: • Named Entity Tagging • Bill Gates is the chairman of Microsoft • Relationship Detection: leads to entity profiles • chairman-of(Bill Gates, Microsoft) • Event Detection • executive change • person_in, person_out • company_involved • date • Scenario Extraction • Bombing incident • where • # of casualties • reason • follow-up • events involved: ordered sequentially
Named Entity Tagging Bridgestone Sports Co. said Friday it has set up a joint venture in Hong Kong with a local concern and a Japanese trading house to produce golf clubs to be shipped to Japan. The joint venture, Bridgestone Sports Hong Kong Co., capitalized at 20 million Hong Kong dollars, will start production in January 1990 with production of 20,000 iron and "metal wood" clubs a month. The monthly output will be later raised to 50,000 units, Bridgestone Sports spokesman Tom White said. The new company, based in Kaohsiung, southern Hong Kong , is owned 75 pct by Bridgestone Sports, 15 pct by Union Precision Casting Co. of Hong Kong and the remainder by Taga Co., a company active in trading with Hong Kong, the officials said.
Output of Named Entity Tagger <company>Bridgestone Sports Co.</company> said <date> Friday</data> it has set up a joint venture in <city>Hong Kong </city> with a local concern and a <ethnic>Japanese</ethnic> trading house to produce golf clubs to be shipped to <country> Japan</country>. The joint venture, <company>Bridgestone Sports Hong Kong Co. </company>, capitalized at <money>20 million Hong Kong dollars</money>, will start production in <date>January 1990 </date> with production of 20,000 iron and "metal wood" clubs a month.The monthly output will be later raised to 50,000 units, <company>Bridgestone Sports</company> spokesman <man> Tom White</man>, said.
Named-Entity Definition • Named-entity is a word or phrase that denotes a proper name such as person, organization, location, product, temporal expression and numerical expression. • Name classes are associated with individual words. • A named-entity is associated with a contiguous word sequence with the same name class.
Entity Profiles <Person Profile id=1>: <Person Profile id=1>: : Waleed Alshehri name : Waleed aliases : a Saudi commercial pilot position : mid - 20s age : MALE gender : Embry - Riddle Aeronautical education University ; FlightSafety Academy Satam Al Suqami ; associations: Wail Alshehri ; Homing Inn; American Flight 11 : < graduated >; Events - involved < hijacking >; < suicide attack >; : quiet and private; descriptors Middle Eastern backgrounds; another of the eventual hijackers;
Event Detection Event: <MOVEMENT> who: 23 foreign fighters whereto: into Pakistan Location:Pakistan, Afghanistan When: normalStr=020622Monday Snippet: Pakistan said Monday its troops arrested 23 foreign fighters trying to cross from Afghanistan into Pakistan over the weekend. Event: <CONTRACT> Money_involved: £5.9 million ($8.9 million) Who: CVF Team, Thomson–CSF,Lockheed Martin, Raytheon, BMT Defense Services, Defense Procurement Agency When: normalStr=021100 last November Snippet: The BAE Systems-led CVF Team and a rival Thomson-CSF group, including Lockheed Martin, Raytheon and BMT Defense Services, were awarded parallel £5.9 million ($8.9 million) contracts by the Defense Procurement Agency last November to undertake first-stage assessment phase work for CVF.
3 Major Approaches to IE • Layout-based • wrapper induction • application focused: e.g. jobs database, processing resumes, etc. • IR-based • “concept” extraction • uses techniques such as pattern matching, proximity, co-occurrence • often seen in Knowledge Management applications (e.g. hardware) • NLP-based • statistical techniques (POS tagging, NE tagging) • grammatical techniques • more sophisticated levels of IE possible
Convergence of NLP-driven and IR-driven Approaches to IE Focus on Recall Focus on Precision
Challenges in IE • Normalization • temporal references (today, last year, during the Olympics …) • spatial references (Buffalo) • Alias resolution • George Bush, President Bush • IBM, “the company” • Verb concepts • kill, murder, assassinate, etc. • Diversity of sources • web documents, e-mail, powerpoint, speech/OCR transcripts • sophisticated pre-processing required • Cross-document information consolidation • Rapid domain porting • Intuitive user interface • should support decision making • work flow, visualization, etc.
Homeland Defense: Track Key Entities Based on Watch Lists Discover Other Related Information
Name-Class Definition OR: organization CO: company “Bridgestone Sports Co.”, “Bridgestone Sports Hong Kong Co.”,“Bridgestone Sports” LO: location CI: city “Hong Kong”, CT: country “Japan” PE: person MAN: man “Tom White” TI: time DA: date “Friday” NN : not name “said”, “it has set up a joint venture”, “with a local concern and a ”, “trading house to produce golf “
Name-Class Tree There are 6 top-level name-classes, and 35 sub-type name-classes. Time -- Hour, Part Day, Duration,Frequency, Age, Day, Month, Season, Year, Decade, Century Location -- City, Province, Country, Continent, Ocean, lake, River, Mountain, Road, Region, District, Airport Organization --Company, Government, Association, School, Army, Mass Media Person -- Man, Woman Product -- Vehicle, Software Event -- Conference, Exhibition
Application of Named Entity Tagging • Question-Answering System • Q: Where did Bridgestone Sports Co. set up a joint venture? • A: Hong Kong • Q: When did Bridgestone Sports Hong Kong Co. start • production? • A: January 1990 • Q: Who is the spokesman for Bridgestone Sports? • A: Tom White
Question Asking Points and Named Entities Where Location Q: Where did Bridgestone Sports Co. set up a joint venture? A: Hong Kong When Time Q: When did Bridgestone Sports Hong Kong Co. start production? A: January 1990 Who Person Q: Who is the spokesman for Bridgestone Sports? A: Tom White
Application of Named Entity Tagging(condt.) Support other Information Extraction tasks Extract Correlated Entities (relationship): entity 1: Tom Whiteman relation: employed by entity 2: Bridgestone Sportscompany Extract events: predicate: start argument 1: Bridgestone Sports Hong Kong Co company argument 2: production time: January 1990 date
Other Applications of NE • Search engines • text categorization/filtering • data mining
Statistical Model for Named Entity Tagging Given a sequence of words (W), our goal is to find the sequence of name-class (NC) with maximum Pr(NC|W). For example: word sequence : it has set up a joint venture in Hong Kong Possible name-class sequence it has set up a joint venture in Hong Kong NN NN NN NN NN NN NN NN LO LO LO NN NN NN NN NN NN NN OR LO
Statistical Model for Named Entity Tagging(contd.) • Construct a manually tagged training corpus. • Extract necessary statistics from the corpus to build a statistical model which can automatically compute Pr(NC Seqeunce | W Sequence) for unseen data. • Search the NC sequence which maximizes the probability Pr(NC Sequence | W Sequence) Corpus Statistical Model unseen data tagging
Statistical Model for Named Entity Tagging(contd.) • The size of the training corpus is large enough to provide fairly good unigram and bigram information. • unigram example: Pr(Organization | “US”) • bigram example: Pr(Orgaization | “US”, “the”) • The size of the training corpus is too small to support any direct evaluation beyond bigram. • Question: How to evaluate Pr(NC Sequence| Sentence) based on the above unigram and bigram information. • One solution: transfer the conditional probability into (NC,Sentence) joint probability (Bayes’ rule) • Decouple sentence into bigram sequences (Markov assumption)
Bayes’ Rule Using Bayes’ rule, we have
Markov Assumption By Markov assumption, we have
Markov Assumption (condt.) So the final formula is
Hidden Markov Model • Define Hidden Markov Model as follows: • An output alphabet Ή={0,1,…V-1} • A state space ф={1,2,…c}; • A transition probability distribution between states and associated output symbols p(symboln, staten | symboln-1, staten-1). • In case of named entity tagging, regard word as output symbol, and the tags as the states. The above statistical NE model is a Hidden Markov Model. • W1 W2 W3 W4 ….. • <SS> PE PE PE PE • LO LO LO LO • OR OR OR OR
Statistics Estimation The generation of words and name-class proceeds in three steps: The Most Likelihood Estimation (MLE) of the above probabilities are as follows:
Easy and Difficult Cases • Some cases are easy • Matsushita Electric Industrial Co. has reached agreement … • Victor C. of Japan (JVC) and Sony Corp. ... • Some cases are particularly difficult: • In a factory of Blaupunkt Weke, a Robert Bosch subsidiary, … • Touch Panel Systems, capitalized at 50 million Yen is owned ...
Machine learning vs. handcrafted rules • Handcrafted finite state patterns can be very effective: • <proper-noun>+ <corporate designator> --> <corporation> e.g. Sony Corp. • Problems with handcrafted approach • each new source requires tweaking, i.e. domain porting can be tedious • speech recognition transcript, OCR require modification of rules • rules for different languages are radically different • Machine learning approach more scalable • exception: numerical expressions, other patterns which are very regular, e.g. contact information telephone numbers, URLs, postal addresses, etc.
NE tagger- Bikel et al • PDF file
Viterbi Search • Viterbi search algorithm is used to search the NC sequence which maximizes the following probability • W1 W2 W3 W4 ….. • <SS> PE PE PE PE • LO LO LO LO • OR OR OR OR • Best paths reach nodes associated with w1 is self-clear. • 3 paths reaches the node (W2, PE) : (PE PE –1.0), (LO,PE, -1.5), (OR,PE,-0.95). The best path reaching (W2,PE) is (OR,PE,-0.95) • Compute the best paths reaching the nodesassociated with w2. • Keep the best reaching path only and continue the same computation to the next word. -0.2 -0.8 -1.2 -0.3 -0.9 -0.05
What next? • We know how to tag Nes locally. What next? • Alias resolution • George W. Bush, President Bush, Bush • Relationship extraction • affiliation • spouse • address • Event Detection • Entity Profiles
Extracting relationships and events • Two major approaches • grammatical • statistical • Grammatical approaches • requires SVO parsing, semantic parsing as a first step • follow up by specialized relationship and event extraction grammars • Two approaches here also: • one behemoth grammar (CFG) • cascaded, finite state grammars • Statistical approaches • supervised learning approach • unsupervised approach using extraction patterns
Architecture of InfoXtract Engine/Platform Document Processor Natural Language Processing Legend Source Document Tokenizer Zoned Text Document HTTP Post Web HTTP Process Manager Server Linguistic Modules CORBA Lexicon Lookup Token List POS Tagging XML Formatted Extracted Document Output Manager Hybrid Model NE Token List Named Entity FST Module HTTP Response Detection CE Number Procedure or Normalization Statistical Model Hybrid Time/location Document & Error Log SVO Module Normalization Shallow CO Parsing NE: Named Entity CE: Correlated Entity SVO: Subject-Verb-Object CO: Co-reference GE: General Event PE: Pre-defined Event POS: Part Of Speech FST: Finite State Transducer Semantic Parsing Knowledge Resources Profile Relationship Detection Alias/Coreference Linking Lexicon GE Resources Pragmatic Filtering Profile/Event Merge Grammars PE Profile/Event Linking
Adapting FSTs for NLP engines • Traditionally, FSTs have operated on character streams- both input and output • primarily used in lexical transducers • InfoXtract tokenizer converts input stream into tokenlist: all subsequent modules operate on tokenlist • tokenlist contains the following information: • linguistic features (POS, semantic class from WordNet etc.) • linguistic structures derived from NLP (e.g., SVO) • information extraction output: NE, relationships, events • pointers to tokens (text offsets) • real objects (text strings) as well as virtual objects • FST grammars operate on tokenlists and can utilize features at several levels • character/string level, structure level • equivalent to tree-walking automata