590 likes | 752 Views
CSA3180: Natural Language Processing. Information Extraction 2 Named Entities Question Answering Anaphora Resolution Co-Reference. Introduction. Slides partially based on talk by Lucian Vlad Lita Sheffield GATE Multilingual Extraction slides based on Diana Maynard’s talks
E N D
CSA3180: Natural Language Processing Information Extraction 2 Named Entities Question Answering Anaphora Resolution Co-Reference CSA3180: Information Extraction II
Introduction • Slides partially based on talk by Lucian Vlad Lita • Sheffield GATE Multilingual Extraction slides based on Diana Maynard’s talks • Anaphora resolution slides based on Dan Cristea slides, with additional input from Gabriela-Eugenia Dima, Oana Postolache and Georgiana Puşcaşu CSA3180: Information Extraction II
References • Fastus System Documentation • Robert Gaizauskas “IE Perspective on Text Mining” • Daniel Bikel’s “Nymble: A High Performance Learning Name Finder” • Helena Ahonen-Myka’s notes on FSTs • Javelin system documentation • MUC 7 Overview & Results CSA3180: Information Extraction II
Named Entities • Named Entities • Person Name: Colin Powell, Frodo • Location Name: Middle East, Aiur • Organization: UN, DARPA • Domain Specific vs. Open Domain CSA3180: Information Extraction II
unprocessed text AR annotated text AR golden standard Anaphora Resolution AR engine annotation tool fine-tuning comparison & evaluation CSA3180: Information Extraction II
Anaphora Resolution • Text: • Nature of discourse • Anaphoric phenomena • Anaphora Resolution Engines: • Models • General AR Frameworks • Knowledge Sources CSA3180: Information Extraction II
Anaphora Resolution Anaphora represents the relationbetween a “proform”(called an “anaphor”) and another term (called an "antecedent"), when the interpretation of the anaphor is in a certain way determined by the interpretation of the antecedent. Barbara Lust, Introduction to Studies in the Acquisition of Anaphora, D. Reidel, 1986 CSA3180: Information Extraction II
Anaphora Example It was a bright cold day in April, and the clocks were striking thirteen. Winston Smith, his chin nuzzled into his breast in an effort to escape the vile wind, slipped quickly through the glass doors of Victory Mansions, though not quickly enough to prevent a swirl of gritty dust from entering along with him. Orwell, 1984 anaphor antecedent anaphor antecedent CSA3180: Information Extraction II
Anaphora • pronouns (personal, demonstrative, ...) • full pronouns • clitics (RO: dă-mi-l, IT: dammelo) • nouns • definite • indefinite • adjectives, numerals (generally associated with an ellipsis) • In this the play is expressionist1 in its approach to theme. • But it is also so1 in its use of unfamiliar devices... CSA3180: Information Extraction II
Referential Expressions • mark the noun phrases • for each NP ask a question about it • keep as REs those NPs that can be naturally referenced in the question The policeman gotin the car in a hurry in order to catch the run-away thief. CSA3180: Information Extraction II
Referential Expressions a. John was going down the street looking for Bill‘s house. b. He found it at the first corner. CSA3180: Information Extraction II
Referential Expressions a. John was going down the street looking for Bill‘s house. b. He met him at the first corner. CSA3180: Information Extraction II
Referential Expressions The empty anaphor Gianni diede una mela a Michele. Piu tardi, gli diede un’arancia. [Not&Zancanara, 1996] John gave an apple to Michelle. Later on,gave her an orange. CSA3180: Information Extraction II
Textual Ellipsis The functional (bridge) anaphora The state of the accumulator is indicated to the user.30 minutes before the complete uncharge, the computer signals for 5 seconds. [Strube&Hahn, 1996] CSA3180: Information Extraction II
Events, States, Descriptions He left without eating1. Because of this1, he was starving in the evening. But, he adds, Priesley is more interested in Johnson living than in Johnson dead1. In this1 the play is expressionist in its approach to theme. [Halliday & Hassan, 1976] CSA3180: Information Extraction II
Definite/Indefinite NPs Once upon a time, there was a king and a queen. And the king one day went hunting. Apollo took out hisbow... Take the elevator to the 4th floor. CSA3180: Information Extraction II
Anaphora Resolution • State of the art in Anaphora Resolution: • Identity: 65-80% • Other: much less… CSA3180: Information Extraction II
What is so difficult? Nothing – everything is so simple! John1 has just arrived. He1 seems tired. The girl1 leaves the trash on the table and wants to go away. The boy2 tries to hold her1 by the arm31; she1 escapes and runs; he2 calls her1 back. Caragiale, At the Mansion CSA3180: Information Extraction II
What is so difficult? Nothing indeed, but imagine letting the machine go wrong... There‘s a pile of inflammable trash next to your car. You‘ll have to get rid of it. If the baby does not thrive on the raw milk, boil it. [Hobbs, 1997] CSA3180: Information Extraction II
What is so difficult? Semantic restrictions Jeff1 helped Dick2 wash the car. He1 washed the windows as Dick2 waxed the car. He1 soaped a pane. Jeff1 helped Dick2 wash the car. He1 washed the windows as Dick2 waxed the car. He2 buffed the hood. [Walker, Joshi & Prince, 1997] CSA3180: Information Extraction II
What is so difficult? Semantic corelates An elephant1hit the car with the trunk.The animal1had to be taken away not to produce other damages. * An animal1hit the car with the trunk.The elephant1had to be taken away not to produce other damages. CSA3180: Information Extraction II
What is so difficult? Long distance recovery (pronominalization) • His re-entry into Hollywood came with the movie “Brainstorm”, • but its completion and release has been delayed by the death of co-star Natalie Wood. • He plays Hugh Hefner of Playboy magazine in Bob Fosse’s “Star 80.” • It’s about Dorothy Stratton, the Playboy Playmate who was killed by her husband. • He also stars in the movie “Class.” Los Angeles Times, July 18, 1983, cited in [Fox, 1986] CSA3180: Information Extraction II
What is so difficult? Gender mismatches Mr. Chairman..., what is her position upon this issue? (political correctness!!) Number mismatches The governmentdiscussed ...They... CSA3180: Information Extraction II
What is so difficult? Distributed antecedents John1 invited Mary2 to the cinema. After the movie endedthey3={1,2} went to a restaurant. CSA3180: Information Extraction II
What is so difficult? Empty/non-empty anaphors Johngave an apple toMichelle. Later on, gave her an orange. Johngave an apple toMichelle. Later on, hegave her an orange. Johngave an apple toMichelle. Later on, this oneasks him for an orange. CSA3180: Information Extraction II
Semantics are Essential Police ... They Teacher... She/He A car... The automobile A Mercedes... The car A lamp... The bulb CSA3180: Information Extraction II
Gender match! Gender mismatch ! Semantics are not all • Pronouns - poor semantic features he[+animate, +male, +singular] she[+animate, +female, +singular] it[+inanimate, +singular] they [+plural] • Gender in Romance languages Ro. maşină = ea (feminine) Ro. automobil =el (masculine) • Anaphora resolution by concord rules Un camion a heurté une voiture. Celle-ci a été complètement détruite. (A truck hit a car. It was completely destroyed.) CSA3180: Information Extraction II
Anaphora Resolution [Charniak, 1972] It order to do AR, one has to be able to do everything else. Once everything else is done AR comes for free. CSA3180: Information Extraction II
Referential expressions Collect Filter a1, a2, a3, … an Preference Anaphora Resolution Most current anaphora resolution systems implement a pipeline architecture with three modules: • Collect: • determines the List of Potential Antecedents (LPAs). a1, a2, a3, … an • Filter: • eliminates from the LPA the referees that are incompatible with the referential expression under scrutiny. • Preference: • determines the most likely antecedent on the basis of an ordering policy. CSA3180: Information Extraction II
Anaphora Resolution Models • [Hobbs, 1976] (pronominal anaphora) Naïve algorithm: • implies a surface parse tree • navigation on the syntactic tree of the anaphor‘s sentence and the preceding ones in the order of recency, each tree in a left-to-right, breadth-first manner A semantic approach: • implies a semantic representation of the sentences (logical expression) • a collection of semantic operations (inferences) • type of pronoun is important CSA3180: Information Extraction II
Anaphora Resolution Models • [Lappin & Leass, 1994] (pronominal anaphora) • syntactic structures • an intrasentensial syntactic filtering • morphological filter (person, number, gender) • detection of pleonastic pronouns • salience parameters (grammatical role, parallelism of grammatical roles, frequency of mention, proximity, sentence recency) CSA3180: Information Extraction II
Anaphora Resolution Models • [Sidner, 1981], [Grosz&Sidner, 1986] • focus/attentional based • give more salience to those semantic entities that are in focus • define where to look for an antecedent in the semantic structure of the preceding text (a stack in G&S‘s model) CSA3180: Information Extraction II
AR Models: Centering • [Grosz, Joshi, Weinstein, 1983, 1995] • [Brennan, Friedman and Pollard, 1987] • Cf(u) = <e1, e2, ... ek> - an ordered list • Cb(u) = ei • Cp(u) = e1 • CON > RET > SSH > ASH Cb(u) = Cb(u-1) Cb(u) Cb(u-1) Cb(u) = Cp(u) Cb(u) Cp(u) CSA3180: Information Extraction II
AR Models: Centering a. I haven’t seen Jeff for several days. b. Carl thinks he’s studying for his exams. c. I think he? went to the Cape with Linda. [Grosz, Joshi & Weinstein, 1983] Cf = (I=[I], [Jeff]) Cb = [I] Cf = ([Carl], he=[Jeff], [Jeff´s exams]) Cb = [Jeff] CSA3180: Information Extraction II
Cf = (I=[I], he=[Jeff], [the Cape], [Linda]) Cb = [Jeff] Cf = (I=[I], he=[Carl], [the Cape], [Linda]) Cb = [Carl] AR Models: Centering b. Carl thinks he’s studying for his exams. c. I think he? went to the Cape with Linda. Cf = ([Carl], he=[Jeff], [Jeff´s exams]) Cb = [Jeff] Jeff RETAINING ABRUPT SHIFT CSA3180: Information Extraction II
Anaphora Resolution Models • [Mitkov, 1998] • knowledge-poor approach • POS tagger, noun phrase rules • 2 previous sentences • definiteness, giveness, lexical reiteration, section heading preference, distance, terms of the field, etc. CSA3180: Information Extraction II
General Framework Build a framework capable of easily accommodating any of the existing AR models, fine-tune them, practice with them to enhance performance (learning), eventually obtaining a better model CSA3180: Information Extraction II
AR-model1 AR-model2 AR-model3 General Framework text AR-engine CSA3180: Information Extraction II
The text layer b The semantic layer a evokes centera b evokes centera centera Co-References • Halliday and Hassan: a semantic relation, not a textual one Co-referential anaphoric relation a CSA3180: Information Extraction II
real time 1 2 discourse time 1 2 story time 2 1 800 920 1000 1030 Time and Discourse • Discourse has a dynamic nature Time axes CSA3180: Information Extraction II
his Dillard Dillard Cheshire Cheshire Resolution Moment Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also. [Tanaka, 1999] CSA3180: Information Extraction II
Resolution Delay • Sanford and Garrod (1989) • initiation point • completion point • Information is kept in a temporary location of memory CSA3180: Information Extraction II
Cataphora – What is there? • The element referred to is anticipated by the referring element • Theories • scepticism • syntactic reality From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum… Oscar Wilde, The Picture of Dorian Gray CSA3180: Information Extraction II
I taught Gabriel to read. = Ro.L-am învatat pe Gabriel sa citeasca. No right reference needed in discourse processing • Introduction of an empty discourse entity • Addition of new features as discourse unfolds • Pronoun anticipation in Romanian CSA3180: Information Extraction II
he John gender = masc number = sg sem = person name = John gender = masc number = sg ? gender = masc number = sg sem = person name = John Unique directionality in interpretation John he anaphora cataphora CSA3180: Information Extraction II
b RE a projects fsa fsa evokes centera Automatic Interpretation • necessity for an intermediate level a The text layer fsa The restriction layer centera The semantic layer CSA3180: Information Extraction II
projects projects no = sg sem=bicycle det = yes no = sg sem=¬human evokes evokes no = sg sem=bicycle det = yes Three Layer Approach to AR 1. John sold his bicycle 2. although Bill would have wanted it. his bicycle it The text layer …………………………………………… The restrictions layer …… ………………… The semantic layer ………… CSA3180: Information Extraction II
t0 t1 t2 t3 Dillard Dillard Cheshire his fshis candidates={ , } fsDillard fsDillard fsCheshire Cheshire Dillard Delayed Interpretation Police officer David Cheshire went to Dillard's home. Putting his ear next to Dillard's head, Cheshire heard the music also. The text layer The restriction layer The semantic layer CSA3180: Information Extraction II
t2 t1 t0 Lord Henry Wotton his he The text layer projection gender=masc number=sing sem= person name= Lord Henry Wotton The restriction layer evoking completes gender=masc number=sing sem= person name= Lord Henry Wotton gender=masc number=sing sem = person ? evoking initiates The semantic layer Delayed Interpretation From the corner of the divan of Persian saddle-bags on which he was lying, smoking, as was his custom, innumerable cigarettes, Lord Henry Wotton could just catch the gleam of the honey-sweet and honey-coloured blossoms of a laburnum… time CSA3180: Information Extraction II
projects projects no = sg sem=bicycle det = yes no = sg sem=¬human evokes evokes no = sg sem=¬human The case of Cataphora 1. Although Bill would have wanted it, 2. John sold his bicycle to somebody else. it his bicycle The text layer …………………………………………… The restrictions layer …… ………………… no = sg sem=bicycle det = yes The semantic layer ………… CSA3180: Information Extraction II