Relation Extraction

Relation Extraction Pierre Bourreau LSI-UPC PLN-PTM

Plan • Relation Extraction description • Sampling templates • Reducing deep analysis errors… • Conclusion

Relation Extraction Description • Finding relations between entities into a text • Filling pre-defined templates slots • One-value-per-field • Multi-value • Depend on analysis: • Chunking • Tokenization • Sentence Parsing…

Plan • Relation Extraction description • Sampling templates (Cox, Nicolson, Finkel, Manning) • Reducing deep analysis errors… • Conclusion

First Example: Sampling Templates • Example: workshop announcement • PASCAL corpus • Relations to extract: • dates of events • Workshop conferences names, acronyms and URL • Domain knowledge: • Constraints on dates • Constraints on names

PASCAL Corpus: semi-structured corpus • <0.26.4.95.11.09.31.hf08+@andrew.cmu.edu.0> • Type: cmu.andrew.academic.bio • Topic: "MHC Class II: A Target for Specific Immunomodulation of the • Immune Response" • Dates: 3-May-95 • Time: <stime>3:30 PM</stime> • Place: <location>Mellon Institute Conference Room</location> • PostedBy: Helena R. Frey on 26-Apr-95 at 11:09 from andrew.cmu.edu • Abstract: • Seminar: Departments of Biological Sciences • Carnegie Mellon and University of Pittsburgh • Name: <speaker>Dr. Jeffrey D. Hermes</speaker> • Affiliation: Department of Autoimmune Diseases Research & Biophysical Chemistry • Merck Research Laboratories • Title: "MHC Class II: A Target for Specific Immunomodulation of the • Immune Response" • Host/e-mail: Robert Murphy, murphy@a.cfr.cmu.edu • Date: Wednesday, May 3, 1995 • Time: <stime>3:30 p.m.</stime> • Place: <location>Mellon Institute Conference Room</location> • Sponsor: MERCK RESEARCH LABORATORIES • Schedule for 1995 follows: (as of 4/26/95) • Biological Sciences Seminars 1994-1995 • Date Speaker Host • April 26 Helen Salz Javier L~pez • May 3 Jefferey Hermes Bob Murphy • MERCK RESEARCH LABORATORIES

PASCAL Corpus: semi-structured corpus • <1.21.10.93.17.00.39.rf1u+@andrew.cmu.edu.0> • Type: cmu.andrew.org.heinz.great-lake • Topic: Re: PresentationCC: • Dates: 25-Oct-93 • Time: <stime>12:30</stime> • PostedBy: Richard Florida on 21-Oct-93 at 17:00 from andrew.cmu.edu • Abstract: • Folks: • <paragraph> <sentence>Our client has requested that the presentation be postponed until Monday • during regular class-time</sentence>. <sentence>He has been asked to make a presentaion for • the Governor of Michigan and Premier of Ontario tommorrow morning in • Canada, and was afraid he could not catch a plane in time to make our • presentation</sentence>. <sentence>After consulting with Rafael and a sub group of project • managers, it was decided that Monday was the best feasible presentation • alternative</sentence>. <sentence>Greg has been able to secure Room 2503 in Hamburg Hall for • our presentation Monday during regular class-time</sentence>. </paragraph> • <paragraph><sentence>We will meet tommmorow in <location>2110</location> at <stime>12:30</stime> (lunch provided) to finalize • presentation and briefing book</sentence>. <sentence>Also, the client has faxed a list of • reactions and questions for discussion which we should review</sentence>. • <sentence>Thanks very much for your hard work and understanding</sentence>. <sentence>Look forward to • seeing you tommorrow</sentence>.</paragraph> • Richard

Idea • Sampling Templates: • Generate all available templates • Give a probability to each of them • Relational model: • Constraints on dates: order • 1. submission dates • 2. acceptance dates • 3. workshop dates / camera ready dates • Constraints on names. • Slots: name, acronym, URL • URL is generated from acronyms

Baselines • CRF • Cliques: max=2 • Viterbi algorithm • Token => GATE tokenization • CMM • Idem • Window of the four previous tokens

Templates sampling • Tokens  p(Li|Li-1) or p(Li|Li-1,…, Li-4) on 100 of documents • Template: • Each slot holds one/no filler value • -> date templates: • SUB_DATE • ACC_DATE • WORK_DATE • CAMREADY_DATE

Templates sampling • Tokens  p(Li|Li-1) or p(Li|Li-1,…, Li-4) on 100 of documents • Template: • Each slot holds one/no filler value • -> name templates: • CONF_NAME • CONF_ACRO • CONF_URL • WORK_NAME • WORK_ACRO • WORK_URL

Templates sampling • D a distribution of these templates, over the training set. => LOCAL MODEL (PL)

Templates scoring: Date Model • PA/P: Probability of present/absent fields. Set with training data • Po: Ordering probability. We give penalty to constraints violations. • PA/P* Po = Prel

Templates scoring: Name Model • Name->Acronym: independent module (likelihood score – Chang 2002): Pnam->acr • Acronym->URL: empirical probability from training: Pacr->url • Pb: missing entry give advantage to incomplete templates. • PA/P: pondering templates (in training, most values are filled) • Prel= Pnam->acr *Pacr->url *PA/P

Results: 300 documents

Results • No results over CRF • CRF accepts variation (ex: name) • => lower recall • Rel. Model does not improve CRF (not on graph) • Low-window of CRF => less info in distribution. • Substantial improvement over CMM (5%)

Plan • Relation Extraction description • Sampling templates • Reducing deep analysis errors (Zao, Grishman) • Conclusion

Problematic • Use different syntactic analysis for the task: • Tokenization • Chunking • Sentence Parsing • … • The more info they give, the less accurate they are. • =>combine them to correct errors

ACE task… remember • Entities: • PERson – ORGanisation – FACility – GeoPoliticEntity - LOCation – WEApon – VEHicle • Mentions: • NAM (proper), NOM (nominal), PRO (pronoun) • Relations: • EMP-ORG, PHYS, GPE-AFF, PER-SOC, DISC, ART, Other

Kernel, SVM … nice properties • Kernel: • Function replacing scalar vector products • Enables us to translate problems into a higher-dimension space for solution • Sum, product generates kernels. • SVM: • SVM can pick up features for best separation

The relational model • R=(arg1, arg2, seq, link, path) • arg1, arg2: the two entities to compare • seq=(t1, …, tn): sequence of tokens intervening • link=(t1, …, tm): idem seq but just with important words • path: a dependency path… • T=(word, pos, base) • pos: Part Of Speech tagging • base: morphological base • E=(tk, type, subtype, mtype) • type: according to ACE type • subtype: refining • mtype: the way it is mentioned • DT=(T, dseq) • dseq=(arc1, …, arcn) • ARC=(w, dw, label, e) • w: current token • dw: token connected to w • label: role label of this arc • e: direction of the arc

The relational model: example arg1=((“areas”, “NNS”, “area”, dseq), “LOC”, “region”, “NOM”) • arg1.dseq=((OBJ, areas, in, 1), (OBJ, areas, controlled, 1)) path=((OBJ, areas, controlled, 1), (SBJ, controlled, troops, 0))

Kernels • Argument kernel: • Matches two tokens, comparing each fix arguments (word, pos, type…) • Bigram kernel: • Matches token on a window of size 1 • Link sequence kernel: • Relations often occur in a short context.

Kernels (2) 4. Dependency path kernel: • How similar are two paths? 5. Local dependency kernel: • Idem as path but more informative. • Helpful if dependency path does not exist.

Results: adding info into SVM • The more information we give, the better the result. • Link Sequence Kernel boosts results.

Results: SVM or KNN • SVM behaves globally better • Polynomial extension has no consequence on KNN. • Training problem in the last three. • … good results over ACE official task… secret, no comparison available

Conclusion • Really simple method • Nice properties of Kernel/SVMs • This method is generic!!! (tested on annotated text) • Looks like SVM can process better, for this task. • … but hard to compare the two methods as goals are different.

References • [1] Template Sampling for Leveraging Domain Knowledge in Information Extraction. Cox, Nicolson, Finkel, Manning, Langley. Stanford University. • [2] Extracting Relations with Integrated Information Using Kernel Methods. Zao, Grishman. New York University. 2005

Relation Extraction