250 likes | 377 Views
Hybrid Systems for Information Extraction and Question Answering. Presented By Rani Qumsiyeh. What is Question Answering?. Being able to retrieve the exact piece of information the user is looking for rather than a set of relevant documents. Who was the president of the US in 2004?
E N D
Hybrid Systems for Information Extraction and Question Answering Presented By RaniQumsiyeh
What is Question Answering? • Being able to retrieve the exact piece of information the user is looking for rather than a set of relevant documents. • Who was the president of the US in 2004? George W. Bush
What is Summarization? • “Text summarization can be regarded as the most interesting and promising Natural Language Understanding task computational linguists are currently faced with” Rodolfo Delmonte • Summarization means taking a large piece of text and extracting the most important ideas out of it. • The story of the 3 little pigs Once upon a time there were three little pigs who lived happily in the countryside. But in the same place lived a wicked wolf who fed precisely on plump and tender pigs. The little pigs therefore decided to build a small house each, to protect themselves from the wolf. The oldest one, Jimmy who was wise, worked hard and built his house with solid bricks and cement. The other two, Timmy and Tommy, who were lazy settled the matter hastily and built their houses with straw and pieces of wood. The lazy pigs spent their days playing and singing a song that said, "Who is afraid of the big bad wolf?" And one day, lo and behold, the wolf appeared suddenly behind their backs. "Help! Help!", shouted the pigs and started running as fast as they could to escape the terrible wolf. He was already licking his lips thinking of such an inviting and tasty meal. The little pigs eventually managed to reach their small house and shut themselves in, barring the door. They started mocking the wolf from the window singing the same song, "Who is afraid of the big bad wolf?" In the meantime the wolf was thinking a way of getting into the house. He began to observe the house very carefully and noticed it was not very solid. He huffed and puffed a couple of times and the house fell down completely. Frightened out of their wits, the two little pigs ran at breakneck speed towards their brother's house. "Fast, brother, open the door! The wolf is chasing us!" They got in just in time and pulled the bolt. Within seconds the wolf was arriving, determined not to give up his meal. Convinced that he could also blow the little brick house down, he filled his lungs with air and huffed and puffed a few times. There was nothing he could do. The house didn't move an inch. In the end he was so exhausted that he fell to the ground. The three little pigs felt safe inside the solid brick house. Grateful to their brother, the two lazy pigs promised him that from that day on they too would work hard.
Could this be automated? • When understanding a text a human reader or listener does make use of his encyclopedia parsimoniously. • To do it automatically, the system should simulate the actual human behavior in that the access to extra linguistic knowledge is triggered by contextual factors independently present in the text and detected by the system itself. • Most simple approach is to use the Bag Of Words (BOW). • For question answering, out of the first n documents retrieved, extract the words in the question along with a certain number of neighboring words. • For summarization, extract all sentences with title keywords in them.
What is the Problem? • The problem the researchers are trying to tackle is taken from P. Bosch contribution to a book by Herzog & Rollinger(eds), “Text Understanding in LILOG”. • Identifying in a text "inferentially unstable" concepts which are to be kept distinct from "inferentially stable" ones. The latter should be analyzed solely on the basis of linguistic description, while the former should tap external linguistic knowledge of the world. • We identify tout court with contextual reasoning, i.e. performing inferential processes on the basis of linguistic information while keeping under control the contribution of external knowledge in order to achieve understanding of a text
Example of the Problem • More information from query • Bill surprised Hillary with his answer • The word his refers to Bill, hence, answer refers to Bill. • Same Head Problem • The president of Russia visited the president of China • Who visited the president? • Reversible Arguments Problem • What do frogs eat? • What eats frogs?
The solution, A Hybrid System • Symbolic processing is defined as those computations that are performed at the same or more abstract level than the word level. • Statistical natural-language processing uses stochastic, probabilistic and statistical methods to resolve some of the ambiguities of text. • Syntactic processing deals with certain aspects of meaning that can be determined only from the underlying structure and not simply from the linear string of words. • Semantic analysis involves extracting context-independent aspects of a sentence's meaning. • In order to act and think like a human a system needs both.
GETARUNS (General Text And Reference UNderstander) • Works in the following way: • Performs semantic analysis on the basis of syntactic parsing. • Performs Anaphora Resolution. • Builds a quasi logical form with flat indexed Augmented Dependency Structures (Discourse Model) • Uses a centering algorithm to individuate the topics or discourse centers which are weighted on the basis of a relevance score. • This logical form can then be used to individuate the best sentence candidates to answer queries or provide appropriate information.
The parser • Rule-based deterministic parser. • Uses a lookahead and a Well-Formed Substring Table to reduce backtracking. • It also implements Finite State Automata in the task of tag disambiguation. • It is based on a top down, depth-first search tree.
Example of the F-Structure produced by the Parser • John went into a restaurant index:f1 pred:go lex_form:[np/subj/agent/[human, object], pp/obl/locat/[to, in, into]/[object, place]] voice:active; mood:ind; tense:past cat:result subj/agent:index:sn4 cat:[human] pred:'John' gen:mas; num:sing; pers:3; spec:def:'0' tab_ref:[+ref, -pro, -ana, -class] obl/locat:index:sn5 cat:[place] pred:restaurant num:sing; pers:3; spec:def:- tab_ref:[+ref, -pro, -ana, +class] qmark:q1 aspect:achiev_tr rel1:[td(f1_res2)=tr(f1_res2)] rel2:[included(tr(f1_res2), tes(f1_res2))] specificity:+ ref_int:[tr(f1_res2)] qops:qop:q(q1, indefinite)
Building the Discourse Model • A set of entities and relation between them, as “specified” in a discourse. • Discourse Entities can be used as Discourse Referents. • Entities and relation in a Discourse Model can be interpreted as representations of the cognitive objects of a mental model. • Representation inspired to Situation Semantics. • Implemented as prolog facts.
DM and infons • Any piece of information is added to the DM as an infon. Infon(Index, Relation(Property), List of Arguments - with Semantic Roles, Polarity - 1 affirmative, 0 negation, Temporal Location Index, Spatial Location Index) • An infon consists of a relation name, its arguments, a polarity (yes/no), and a couple of indexes anchoring the relation to a spatio-temporal location. • EX: meet, (arg1:john, arg2:mary), yes, 22-sept-2008, venice • Each infon has a unique identifier and can be referred to by other infons.
Kinds of Infons • Full infons • Situations: sit/6 • Facts: fact/6 • Complex infons: have other sit/fact as argument • Simplified infons • Entities: ind/2, set/2, class/2 • Cardinalities: card/3 • Membership: in/3 • Spatio-temporal rels: includes/2, during/2, …
Entities, Cardinalities, Membership • Entities are represented in the DM without any commitment about their “existence” in reality. • Individual entities (“John”): ind(infon1, id5). • Extensional plural entities (“his kids”): set(infon2, id6). • Intensional plural entities (“lions”): class(…, id7). • Cardinality (only for sets: “four kids”) • card(…, id6, 5). • Membership (between individual and sets: “one of them”) • in(…, id5, id6).
Anaphora Resolution • Anaphora is an instance of an expression referring to another. • Anaphora Resolution means identifying which instance of an expression Anaphora is referring to.
Two Types of Anaphora • Noun/Noun Phrase (i.e. Nominal) • He doesn’t like this book. Show him a more interesting one. • One refers to the book. • If you want a typewriter, they will provide you with one. • One refers to the typewriter. • Slang disappears quickly, especially the juvenile sort. • Sort refers to Slang • Nominal substitutes also include some indefinite pronouns, such as all, both, some, any enough, several, none, many, much, (a) few, (a) little, the other, others, another, either, neither, etc. eg: • Can you get me some nails? I need some. • Some refers to nails • Pronoun/Pronoun Phrase(i.e. Pronominal) • The Prime Minister of New Zealand visited us yesterday. The visit was the first time she had come to New York since 1998. • She refers to the Prime Minister. • Us refers to the people of New York. • The monkey took the banana and ate it. • it refers to the banana.
How does it work? • Computed by a Module of Discourse Anaphora (MDA). • Decides on the basis of semantic categories attached to predicates and arguments of predicates whether to bind a pronoun to the locally available antecedent or to the discourse level one. • Creates a list of candidates or possible arguments of discourse which includes all external pronouns and referential expressions. The algorithm creates a Weighted List of Candidates Arguments of Discourse(WLCAD)
Ontology Behind Anaphora Resolution • On first occurrence of a referring expression • it is asserted as an INDividual if it is a definite or indefinite expression • it is asserted as a CLASS if it is quantified or has no determiner • We have LOCs for main locations, both spatial and temporal. • Whenever there is cardinality determined by a digit, the referring expression is asserted as a SET • On second occurrence of the same nominal head • The semantic index is recovered from the history list • In case it is definite or indefinite with a predicative role and no attributes nor modifiers, nothing is done; • In case it has different number - singular and the one present in the DM is a set or a class, nothing happens; • In case it has attributes and modifiers which are different and the one present in the DM has none, nothing happens; • In case it is quantified expression and has no cardinality, and the one present in the DM is a set or a class, again nothing happens. • Otherwise a new entity is asserted to the in DM.
GETARUN as a QA system • Uses Bag Of Words to search through Google. • It builds the Discourse Model for the first five snippets. • It looks for the answer using the Discourse model. • It retrieves the snippet with the right answer.
Strengths • No other system out there that does text summarization. • 74% F-measure for Anaphora Resolution. • Is very effective in retrieving the “gist” of the text. • Can answer natural language questions. • Introduces very important algorithms to the NLP community.
Weaknesses • Very slow when dealing with large text. • When summarizing it only manages to maintain 73% of the “important” text. • No actual data to test on. • If data is lost, can we really use such a system. • Achieves a 63% accuracy with question answering. • Cannot answer WHO questions.
Future Work • Consider more than 2 sentences in advance of the current one being processed. • Find a way to deal with all type of questions. • Currently this work is being performed, no publication yet. • Try to increase accuracy, especially in the summarization aspect of the system. • Consider Categories of questions to further “pin down” the answer.