600 likes | 739 Views
Information Extraction A Practical Survey. Mihai Surdeanu. TALP Research Center Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya surdeanu@lsi.upc.es. Overview. What is information extraction? A “traditional” system and its problems
E N D
Information ExtractionA Practical Survey Mihai Surdeanu TALP Research Center Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya surdeanu@lsi.upc.es
Overview • What is information extraction? • A “traditional” system and its problems • Pattern learning and classification • Beyond patterns
What is information extraction? • The extraction or pulling out of pertinent information from large volumes of texts. (http://www.itl.nist.gov/iad/894.02/related_projects/muc/index.html) • Information extraction (IE) systems extract concepts, events, and relations that are relevant for a given scenario domain. • But, what is a concept, an event, or a scenario domain? Actual implementations of IE systems varied throughout the history of the task: MUC, Event99, EELD. • The tendency is to simplify the definition (or rather the implementation) of the task.
Information Extraction at the Message Understanding Conferences • Seven MUC conferences, between 1987 and 1998. • Scenario domains driven by template specifications (fairly similar to database schemas), which define the content to be extracted. • Each event fills exactly one template (fairly similar to a database record). • Each template slot contains either text, or pointers to other templates. • The goal was to use IE technology to populate relational databases. Never really happened: • The chosen representation was too complicated. • Did not address real-world problems, but artificial benchmarks. • Systems never achieved good-enough accuracy.
<SUCCESSION_EVENT-9301190125-1> := SUCCESSION_ORG : <ORGANIZATION-9301190125-1> POST: “chief executive officer” IN_AND_OUT: <IN_AND_OUT- 9301190125-1> <IN_AND_OUT- 9301190125-2> VACANCY_REASON: REASSIGNMENT < IN_AND_OUT- 9301190125-1> := IO_PERSON: <PERSON- 9301190125-1> NEW_STATUS: IN ON_THE_JOB: UNCLEAR OTHER_ORG: <ORGANIZATION- 9301190125-2> REL_OTHER_ORG: OUTSIDE_ORG COMMENT: “Barry Diller IN” … <ORGANIZATION-9301190125-1> := ORG_NAME: “QVC Network Inc.” ORG_TYPE: COMPANY MUC-6 “Management Succession” Example …Barry Diller was appointed chief executive officer of QVC Network Inc… MUC6 Template Template slot with a text fill Template slot that points to another template
Information Extraction at DARPA´s HUB-4 Event99 • Was planned as a successor of MUC. • Identification and extraction of relevant information dictated by templettes, which are “flat”, simplified templates. Slots are filled only with text, no pointers to other templettes are accepted. • Domains closer to real-world applications are addressed: natural disasters, bombing, deaths, elections, financial fluctuations, illness outbreaks. • The goal was to provide event-level indexing into documents such as news wires, radio and television transcripts etcetera. Imagine querying: “BOMBING AND Gaza” in news messages, and retrieving only the relevant text about bombing events in the Gaza area classified into templettes. • Event99: A Proposed Event Indexing Task For Broadcast News. Lynette Hirschman et al. (http://citeseer.nj.nec.com/424439.html)
Compare with: <SUCCESSION_EVENT-9301190125-1> := SUCCESSION_ORG : <ORGANIZATION-9301190125-1> POST: “chief executive officer” IN_AND_OUT: <IN_AND_OUT- 9301190125-1> <IN_AND_OUT- 9301190125-2> VACANCY_REASON: REASSIGNMENT < IN_AND_OUT- 9301190125-1> := IO_PERSON: <PERSON- 9301190125-1> NEW_STATUS: IN ON_THE_JOB: UNCLEAR OTHER_ORG: <ORGANIZATION- 9301190125-2> REL_OTHER_ORG: OUTSIDE_ORG COMMENT: “Barry Diller IN” … <ORGANIZATION-9301190125-1> := ORG_NAME: “QVC Network Inc.” ORG_TYPE: COMPANY Event99 “Death” ExampleTemplettes Versus Templates The sole survivor of the car crash that killed Princess Diana and Dodi Fayed last year in France is remembering more about the accident. <DEATH-CNN3-1> := DECEASED: “Princess [Diana]” / “[Dodi Fayed]” MANNER_OF_DEATH: “the car [crash] that killed Princess Diana and Dodi Fayed” / “the [accident]” LOCATION: ”in [France]” DATE: “last [year]”
Information Extraction at DARPA´s Evidence Extraction and Link Detection (EELD) Program • IE used as a tool for the more general problem of link discovery: sift through large data collections and derive complex rules from collections of simpler IE patterns. • Example: certain sets of account_number(Person,Account), deposit(Account,Amount), greater_than(Amount,reporting_amount) patterns imply is_a(Person, money_launderer). Note: the fact that Person is a money_launderer is not stated in any form in text! • IE used to identify concepts (typically named entities), events (typically identified by trigger words), and basic entity-entity and entity-event relations. • Simpler IE problem: • No templates or templettes generated. • Not dealing with event merging. • Events always marked by trigger words, e.g. “murder” triggers a MURDER event. • Relations are always intra-sentential. • EELD web portal: http://www.rl.af.mil/tech/programs/eeld/
EELD Example John Smith is the chief scientist of Hardcom Corporation. Entities: Person(John Smith), Organization( Hardcom Corporation) Events: -- Relations: person-affiliation(Person(John Smith), Organization(Hardcom Corporation)) The murder of John Smith… Entities: Person(John Smith) Events: Murder(murder) Relations: murder-victim(Person(John Smith), Murder(murder))
Overview • What is information extraction? • A “traditional” system and its problems • Pattern learning and classification • Beyond patterns
Traditional IE Architecture • The Finite State Automaton Text Understanding System (FASTUS) approach: cascaded finite state automata (FSA). • Each FSA level recognizes larger linguistic contructs (from tokens to chunks to clauses to domain patterns), which become the simplified input for the next FSA in the cascade. • Why? Speed. Robustness to unstructured input. Handles data sparsity well. • The FSA cascade is enriched with limited discourse processing components: coreference resolution and event merging. • Most systems in MUC ended up using this architecture: CIRCUS from UMass (was actually the first to introduce the cascaded FSA architecture), PROTEUS (NYU), PLUM (BBN), CICERO (LCC) and many others. • An ocean of information available: • FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text. Jerry R. Hobbs et al. http://www.ai.sri.com/natural-language/projects/fastus-schabes.html • Infrastructure for Open-Domain Information Extraction. Mihai Surdeanu and Sanda Harabagiu. http://www.languagecomputer.com/papers/hlt2002.pdf • Rich IE bibliography maintained by Horacio Rodriguez at: http://www.lsi.upc.es/~horacio/varios/sevilla2001.zip
Language Computer´s CICERO Information Extraction System Documents Recognizes known concepts using lexicons and gazetteers. known word recognition Identifies numerical entities such as money, percents, dates and times (FSA) numerical-entity recognition stand-alone named-entity recognizer Identifies named entities such as person, location, and organization names (FSA) named-entity recognition Disambiguates incomplete or ambiguous names name aliasing phrasal parser Identifies basic, noun, verb, and particle phrases (TBL + FSA) phrase combiner Identifies domain-dependent complex noun and verb phrases (FSA) entity coreference resolution Detects pronominal and nominal coreference links domain pattern recognition Identifies domain-dependent patterns (FSA) event coreference Resolves empty templette slots event merging Merges templettes belonging to the same event Templettes/Templates
<BOMBING> := BOMB: “a car bombing” PERPETRATOR: “Ansar al-Islam” DEAD: “At least seven police officers” INJURED: “as many as 52 other people, including several children” DAMAGE: “a police station” LOCATION: ”Kirkuk” DATE: “Monday” Walk-Through Example (1/5) At least seven police officers were killed and as many as 52 other people, including several children, were injured Monday in a car bombing that also wrecked a police station. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast.
Lexicon + numerical entities + NER At least seven/NUMBER police officers were killed and as many as 52/NUMBER other people, including several children, were injured Monday/DATE in a car bombing that also wrecked a police station. Kirkuk/LOC ´s police said they had "good information" that Ansar al-Islam/ORG was behind the blast. Phrasal parser At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people]/NP, [including]/VP several [children]/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VPa police [station]/NP. [Kirkuk]/NP ´s [police]/NP[said]/VP[they]/NP[had]/NP"good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP. Walk-Through Example (2/5)
Complex phrase detection At least seven police [officers]/NP were [killed]/VP and as many as 52 other [people], includingseveral children/NP, were [injured]/VP [Monday]/NP in a car [bombing]/NP that also [wrecked]/VPa police [station]/NP. [Kirkuk]/NP ´s [police]/NP[said]/VP[they]/NP[had]/NP"good [information]“/NP that [Ansar al-Islam]/NP [was]/VP behind the [blast]/NP. Walk-Through Example (3/5) Entity coreference resolution they The police the blast a car bombing
TEMPLETTE BOMB: “a car bombing” INJURED: “as many as 52 other people, including several children” DATE: “Monday” TEMPLETTE BOMB: “a car bombing” DAMAGE: “a police station” TEMPLETTE BOMB: “a car bombing” PERPETRATOR: “Ansar al-Islam” TEMPLETTE DEAD: “At least seven police officers” Walk-Through Example (4/5) At least seven police officers were killed/PATTERN and as many as 52 other people, including several children, were injured Monday in a car bombing/PATTERN{car bombing} that also wrecked a police station/PATTERN. Kirkuk´s police said they had "good information" that Ansar al-Islam was behind the blast/PATTERN.
TEMPLETTE BOMB: “a car bombing” INJURED: “as many as 52 other people, including several children” DATE: “Monday” TEMPLETTE BOMB: “a car bombing” DAMAGE: “a police station” TEMPLETTE BOMB: “a car bombing” PERPETRATOR: “Ansar al-Islam” TEMPLETTE DEAD: “At least seven police officers” Event coreference TEMPLETTE BOMB: “a car bombing” DEAD: “At least seven police officers” DATE: “Monday” LOCATION: “Kirkuk” TEMPLETTE BOMB: “a car bombing” INJURED: “as many as 52 other people, including several children” DATE: “Monday” LOCATION: “Kirkuk” TEMPLETTE BOMB: “a car bombing” DAMAGE: “a police station” DATE: “Monday” LOCATION: “Kirkuk” TEMPLETTE BOMB: “a car bombing” PERPETRATOR: “Ansar al-Islam” DATE: “Monday” LOCATION: “Kirkuk” Event merging TEMPLETTE BOMB: “a car bombing” PERPETRATOR: “Ansar al-Islam” DEAD: “At least seven police officers” INJURED: “as many as 52 other people, including several children” DAMAGE: “a police station” DATE: “Monday” LOCATION: “Kirkuk” Walk-Through Example (5/5)
Coreference for IE • Algorithm detailed in: Recognizing Referential Links: An Information Extraction Perspective. Megumi Kameyama. http://citeseer.nj.nec.com/kameyama97recognizing.html • 3 step algorithm: • Identify all anaphoric entities, e.g. pronouns, nouns, ambiguous named-entities. • For each anaphoric entity identify all possible candidates and sort them according to same salience ordering, e.g. left-to-right traversal in the same sentence, right-to-left traversal in previous sentences. • Extract the first candidate that matches some semantic constraints, e.g. number and gender consistency. Merge the candidate with the anaphoric entity.
The Role of Coreference in Named Entity Recognition • Classifies unknown named-entities, that are likely part of a name but can not be identified as such due to insufficient local context. • Example: “Michigan National Corp./ORG said it will eliminate some senior management jobs… Michigan National/? said the restructuring…” • Disambiguates named entities of ambiguous length and/or ambiguous type. • “Michigan” changed from LOC to ORG when “Michigan Corp.” appears in the same context. • The text “McDonald´s” may contain a person name “McDonald” or an organization name “McDonald´s”. Non-deterministic FSA used to maintain both alternatives until after name aliasing, when one is selected. • Disambiguate headline named entities. • Headlines typically capitalized, e.g. “McDermott Completes Sale” • Processing of headlines postponed until after the body of text is processed. • A “longest-match” approach is used to match the headline sequence of tokens against entities found in the first body paragraph. For example, “McDermott” is labeled to ORG because it matches over “McDermott International Inc.” in the first document paragraph. • Over 5% increase in accuracy (F-measure): from 87.81% to 93.64%.
The Good • Relatively good performance with a simple system • F-measures over 75% up to 88% for some simpler Event99 domains • Execution times below 10 seconds per 5KB document • Improvements to the FSA-only approach • Coreference almost doubles the FSA-only performance • More extraction rules add little to the IE performance whereas different forms of coreference add more • Non-determinism used to mitigate the limited power of FSA grammars
The Bad • Needs domain-specific lexicons, e.g. an ontology of bombing devices. Work the automate this process: Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. Ellen Riloff and Rosie Jones. http://www.cs.utah.edu/~riloff/psfiles/aaai99.pdf (not covered in this presentation) • Domain-specific patternsmust be developed, e.g. “<SUBJECT> explode”. • Patterns must beclassified: What does the above pattern mean? Is the subject a bomb, a perpetrator, a location? • Patterns can not cover the flexibility of the natural language. Need better models that go beyond the pattern limitations. • Event merging is another NP-complete problem. One of the few stochastic models for event merging: Probabilistic Coreference in Information Extraction. Andrew Kehler. http://ling.ucsd.edu/~kehler/Papers/emnlp97.ps.gz (not covered in this presentation) • All of the above issues are manually developed, which yields high domain development time (larger than 40 person hours per domain). This prohibits the use of this approach for “real-time” information extraction.
Overview • What is information extraction? • A “traditional” system and its problems • Pattern learning and classification • Beyond patterns
Automatically Generating Extraction Patterns from Untagged Text • The first system to successfully discover domain patterns AutoSlog-TS. • Automatically Generating Extraction Patterns from Untagged Text. Ellen Riloff. http://www.cs.utah.edu/~riloff/psfiles/aaai96.pdf • The intuition is that domain-specific patterns will appear more often in documents related to the domain of interest than in unrelated documents.
Weakly-Supervised Pattern Learning Algorithm (1/2) • Separate the training document set into relevant and irrelevant documents (manual process). • Generate all possible patterns in all documents, according to some meta-patterns. Examples below.
Weakly-Supervised Pattern Learning Algorithm (2/2) • Rank all generated patterns according to the formula: relevance_rate x log2(frequency), where the relevance_rate indicates the ratio of relevant instances (i.e. in relevant documents versus non-relevant documents) of the corresponding pattern, and frequency indicates the number of times the pattern was seen in relevant documents. • Add the top-ranked pattern to the list of learned patterns, and mark all documents where the pattern appears as relevant. • Repeat the process from Step 3 for a number of N iterations. Hence the output of the algorithm is N learned patterns.
The Good and the Bad • The good • Performance very close to the manually-customized system • The bad • Documents must be separated into relevant/irrelevant by hand • When does the learning process stop? • Pattern classification and event merging still developed by human experts
The ExDisco IE System • Automatic Acquisition of Domain Knowledge for Information Extraction.Roman Yangarber et al.http://www.cs.nyu.edu/roman/Papers/2000-coling-pub.ps.gz • Quasi automatically separates documents in relevant/non-relevant using a set of “seed” patterns selected by the user, e.g. <company> appoint-verb <person> for the MUC-6 “management succession” domain. • In addition to ranking patterns, ExDisco ranks documents based on how many relevant patterns they contain immediate application to text filtering.
Counter-Training for Pattern Discovery • Counter-Training in Discovery of Semantic Patterns. Roman Yangarber. http://www.cs.nyu.edu/roman/Papers/2003-acl-countertrain-web.pdf • Previous approaches are iterative learning algorithms, where the output is a continuous stream of patterns with degrading precision. What is the best stopping point? • The approach is to introduce competition among multiple scenario learners (e.g. management succession, mergers and acquisitions, legal actions). Stop when the learners wander in the territories already discovered by others. • Pattern frequency weighted by the document relevance. • Document relevance receives negative weight based on how many patterns from a different scenario it contains. • The learning for each scenario stops when the best pattern has a negative score.
Pattern Classification • Multiple systems perform successful pattern acquisition by now, e.g. “attacked <np>” is discovered for the bombing domain. But what does the <np> actually mean? Is it the victim, the physical target, or something else? • An Empirical Approach to Conceptual Case Frame Acquisition. Ellen Riloff and Mark Schmelzenbach. http://www.cs.utah.edu/~riloff/psfiles/wvlc98.pdf
Pattern Classification Algorithm • Requires 5 seed words per semantic category (e.g. PERPETRATOR, VICTIM etc) • Builds a context for each semantic category by expanding the seed word set with words that appear frequently in the proximity of previous seed words. • Uses AutoSlog to discover domain patterns. • Builds a semantic profile for each discovered pattern based on the overlap between the noun phrases contained in the pattern and the previous semantic contexts. • Each pattern is associated with the best ranked semantic category.
Pattern Classification Example Semantic profile for the pattern: attack on <np>
Other Pattern-Learning Systems: RAPIER (1/2) • Relational Learning of Pattern-Match Rules for Information Extraction. Mary Elaine Califf and Raymond J. Mooney. http://citeseer.nj.nec.com/califf98relational.html • Uses Inductive Logic Programming (ILP) to implement a bottom-up generalization of patterns. • Patterns specified with pre-fillers (conditions on the tokens preceding the pattern), fillers (conditions on the tokens included in the pattern), and post-fillers (conditions on the tokens following the pattern) • The only linguistic resource used is a part-of-speech (POS) tagger. No parser (full or partial) used! • More robust to unstructured text. • Applicability limited to simpler domains (e.g. job postings)
Other Pattern-Learning Systems: RAPIER (2/2) located in Atlanta, Georgia offices in Kansas City, Missouri
Other Pattern-Learning Systems • SRV • Toward General-Purpose Learning for Information Extraction. Dayne Freitag. http://citeseer.nj.nec.com/freitag98toward.html • Supervised machine learning based on FOIL. Constructs HORN clauses from examples. • Active learning • Active Learning for Information Extraction with Multiple View Feature Sets. Rosie Jones et al. http://www.cs.utah.edu/~riloff/psfiles/ecml-wkshp03.pdf • Active learning with multiple views. Ion Muslea. http://www.ai.sri.com/~muslea/PS/dissertation-02.pdf • Interactively learn and annotate data to reduce human effort in data annotation.
Overview • What is information extraction? • A “traditional” system and its problems • Pattern learning and classification • Beyond patterns
S NP VP PP ADVP PP S VP NP LOC The space shuttle Challenger flew apart over Florida like a billion-dollar confetti killing six astronauts The Need to Move Beyond the Pattern-Based Paradigm (1/2) The space shuttle Challenger/AGENT_OF_DEATH flew apart over Florida like a billion-dollar confetti killing/MANNER_OF_DEATHsix astronauts/DECEASED. Hard using surface-level information Easier using full parse trees AGENT_OF_DEATH MANNER_OF_DEATH DECEASED
The Need to Move Beyond the Pattern-Based Paradigm (2/2) • Pattern-based systems • Have limited power due to the strict formalism accuracy < 60% without additional discourse processing. • Were developed also due to the historical conjecture: there was no high-performance full parser widely available. • Recent NLP developments: • Full syntactic parsing 90% [Collins, 1997][Charniak, 2000]. • Predicate-argument frames provide open-domain event representation [Surdeanu et al, 2003], [Gildea and Jurafsky, 2002][Gildea and Palmer, 2002].
Goal • Novel IE paradigm: • Syntactic representation provided by full parser. • Event representation based on predicate-argument frames. • Entity coreference provides pronominal and nominal anaphora resolution (future work). • Event merging merges similar/overlapping events (future work). • Advantages: • High accuracy due to enhanced syntactic and semantic processing. • Minimal domain customization time because most components are open-domain.
Proposition Bank Overview S • A one million word corpus annotated with predicate argument structures [Kingsbury, 2002]. Currently only predicates lexicalized by verbs. • Numbered arguments from 0 to 5. Typically ARG0 = agent, ARG1 = direct object or theme, ARG2 = indirect object, benefactive, or instrument, but they are predicate dependent! • Functional tags: ARMG-LOC = locative, ARGM-TMP = temporal, ARGM-DIR = direction. NP VP VP PP NP The futures halt was assailed by Big Board floor traders ARG1 = entity assailed PRED ARG0 = agent
Documents Templettes named-entity recognizer syntactic parser open-domain entity coreference mapping pred-arg structures to templettes domain-specific event merging Block Architecture identification of pred-arg structures
S NP VP PP ADVP PP S VP NP LOC The space shuttle Challenger flew apart over Florida like a billion-dollar confetti killing six astronauts ARG0 PRED ARG1 AGENT_OF_DEATH MANNER_OF_DEATH DECEASED Walk-Through Example The space shuttle Challenger flew apart over Florida like a billion-dollar confetti killing six astronauts.
S NP VP VP PP NP Task 1 The futures halt was assailed by Big Board floor traders PRED ARG1 ARG0 Task 2 The Model • Consists of two tasks: (1) identifying parse tree constituents corresponding to predicate arguments, and (2) assigning a role to each argument constituent. • Both tasks modeled using C5.0 decision tree learning, and two sets of features: Feature Set 1 adapted from [Gildea and Jurafsky, 2002], and Feature Set 2, novel set of semantic and syntactic features.
PHRASE TYPE (pt): type of the syntactic phrase as argument. E.g. NP for ARG1. PARSE TREE PATH (path): path between argument and predicate. E.g. NP S VP VP for ARG1. PATH LENGTH (pathLen): number of labels stored in the predicate-argument path. E.g. 4 for ARG1. POSITION (pos): indicates if constituent appears before predicate in sentence. E.g. true for ARG1 and false for ARG2. VOICE (voice): predicate voice (active or passive). E.g. passive for PRED. HEAD WORD (hw): head word of the evaluated phrase. E.g. “halt” for ARG1. GOVERNING CATEGORY (gov): indicates if an NP is dominated by a S phrase or a VP phrase. E.g. S for ARG1, VP for ARG0. PREDICATE WORD: the verb with morphological information preserved (verb), and the verb normalized to lower case and infinitive form (lemma). E.g. for PRED verb is “assailed”, lemma is “assail”. S NP VP VP PP NP The futures halt was assailed by Big Board floor traders ARG1 PRED ARG0 Feature Set 1
PP in NP last June SBAR that S VP occurred NP yesterday VP to VP be VP declared Observations about Feature Set 1 • Because most of the argument constituents are prepositional attachments (PP) and relative clauses (SBAR), often the head word (hw) is not the most informative word in the phrase. • Due to its strong lexicalization, the model suffers from data sparsity. E.g. hw used < 3%. The problem can be addressed with a back-off model from words to part of speech tags. • The features in set 1 capture only syntactic information, even though semantic information like named-entity tags should help. For example, ARGM-TMP typically contains DATE entities, and ARGM-LOC includes LOCATION named entities. • Feature set 1 does not capture predicates lexicalized by phrasal verbs, e.g. “put up”.
Feature Set 2 (1/2) • CONTENT WORD (cw): lexicalized feature that selects an informative word from the constituent, other than the head. Selection heuristics available in the paper. E.g. “June” for the phrase “in last June”. • PART OF SPEECH OF CONTENT WORD (cPos): part of speech tag of the content word. E.g. NNP for the phrase “in last June”. • PART OF SPEECH OF HEAD WORD (hPos): part of speech tag of the head word. E.g. NN for the phrase “the futures halt”. • NAMED ENTITY CLASS OF CONTENT WORD (cNE): The class of the named entity that includes the content word. 7 named entity classes (from the MUC-7 specification) covered. E.g. DATE for “in last June”.
Feature Set 2 (2/2) • BOOLEAN NAMED ENTITY FLAGS: set of features that indicate if a named entity is included at any position in the phrase: • neOrganization: set to true if an organization name is recognized in the phrase. • neLocation: set to true if a location name is recognized in the phrase. • nePerson: set to true if a person name is recognized in the phrase. • neMoney: set to true if a currency expression is recognized in the phrase. • nePercent: set to true if a percentage expression is recognized in the phrase. • neTime: set to true if a time of day expression is recognized in the phrase. • neDate: set to true if a date temporal expression is recognized in the phrase. • PHRASAL VERB COLLOCATIONS: set of two features that capture information about phrasal verbs: • pvcSum: the frequency with which a verb is immediately followed by any preposition or particle. • pvcMax: the frequency with which a verb is followed by its predominant preposition or particle.
Experiments (1/3) • Trained on PropBank release 2002/7/15, Treebank release 2, both without Section 23. Named entity information extracted using CiceroLite. • Tested on PropBank and Treebank section 23. Used gold-standard trees from Treebank, and named entities from CiceroLite. • Task 1 (identifying argument constituents): • Negative examples: any Treebank phrases not tagged in PropBank. Due to memory limitations, we used ~11% of Treebank. • Positive examples: Treebank phrases (from the same 11% set) annotated with any PropBank role. • Task 2 (assigning roles to argument constituents): • Due to memory limitations we limited the example set to the first 60% of PropBank annotations.