310 likes | 328 Views
CSA350: NLP Algorithms. Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies. References. This lecture is largely based on material found in Jurafsky & Martin chapter 13. Handling Sentences. Sentence boundary detection.
E N D
CSA350: NLP Algorithms Sentence Parsing I The Parsing Problem Parsing as Search Top Down/Bottom Up Parsing Strategies csa3180: Setence Parsing Algorithms 1
References • This lecture is largely based on material found in Jurafsky & Martin chapter 13 csa3180: Setence Parsing Algorithms 1
Handling Sentences • Sentence boundary detection. • Finite state techniques are fine for certain kinds of analysis: • named entity recognition • NP chunking • But FS techniques are of limited use when trying to compute grammatical relationships between parts of sentences. • We need these to get at meanings. csa3180: Setence Parsing Algorithms 1
Grammatical Relationships:e.g. subject Wikipaedia definition: The subject has the grammatical function in a sentence of relating its constituent (a noun phrase) by means of the verb to any other elements present in the sentence, i.e. objects, complements and adverbials. csa3180: Setence Parsing Algorithms 1
Grammatical Relationships:e.g. subject • The dictionary helps me find words. • Ice cream appeared on the table. • The man that is sitting over there told me that he just bought a ticket to Tahiti. • Nothing else is good enough. • That nothing else is good enough shouldn't come as a surprise. • To eat six different kinds of vegetables a day is healthy. csa3180: Setence Parsing Algorithms 1
Why not use FS techniques for describing NL sentences • Descriptive Adequacy • Some NL phenomena cannot be described within FS framework. • example: central embedding • Notational Efficiency • The notation does not facilitate 'factoring out' the similarities. • To describe sentences of the form subject-verb-object using a FSA, we must describe possible subjects and objects, even though almost all phrases that can appear as one can equally appear as the other. csa3180: Setence Parsing Algorithms 1
Central Embedding • The following sentences • The cat spat 1 1 • The cat the boy saw spat 1 2 2 1 • The cat the boy the girl liked saw spat 1 2 3 3 2 1 • Require at least a grammar of the formS → An Bn csa3180: Setence Parsing Algorithms 1
% GRAMMAR s --> np, vp. s --> aux, np, vp. s --> vp. np --> det nom. nom --> noun. nom --> noun, nom. nom --> nom, pp pp --> prep, np. np --> pn. vp --> v. vp --> v np % LEXICON d --> [that];[this];[a]. n --> [book];[flight]; [meal];[money]. v --> [book];[include]; [prefer]. aux --> [does]. prep --> [from];[to];[on]. pn --> [‘Houston’];[‘TWA’]. DCG-style Grammar/Lexicon csa3180: Setence Parsing Algorithms 1
Definite Clause Grammars • Prolog Based • LHS --> RHS1, RHS2, ..., {code}. • s(s(NP,VP)) --> np(NP), vp(VP), {mk-subj(NP)} • Rules are translated into executable Prolog program. • No clear distinction between rules for grammar and lexicon. csa3180: Setence Parsing Algorithms 1
Given grammar G and sentence A discover all valid parse trees for G that exactly cover A Parsing Problem S VP NP V Nom Det book N that flight csa3180: Setence Parsing Algorithms 1
The elephant is in the trousers S VP NP NP NP PP I shot an elephant in my trousers csa3180: Setence Parsing Algorithms 1
I was wearing the trousers S VP NP NP PP I shot an elephant in my trousers csa3180: Setence Parsing Algorithms 1
Parsing as Search • Search within a space defined by • Start State • Goal State • State to state transformations • Two distinct parsing strategies: • Top down • Bottom up • Different parsing strategy, different state space, different problem. • N.B. Parsing strategy ≠ search strategy csa3180: Setence Parsing Algorithms 1
Top Down • Each state comprises: • a tree • an open node • an input pointer • Together these encode the current state of the parse. • Top down parser tries to build from the root node S down to the leaves by replacing nodes with non-terminal labels with RHS of corresponding grammar rules. • Nodes with pre-terminal (word class) labels are compared to input words. csa3180: Setence Parsing Algorithms 1
Top Down Search Space Start node→ Goal node ↓ csa3180: Setence Parsing Algorithms 1
Bottom Up • Each state is a forest of trees. • Start node is a forest of nodes labelled with pre-terminal categories (word classes derived from lexicon) • Transformations look for places where RHS of rules can fit. • Any such place is replaced with a node labelled with LHS of rule. csa3180: Setence Parsing Algorithms 1
Bottom Up Search Space failed BU derivation fl fl fl fl fl fl fl csa3180: Setence Parsing Algorithms 1
Top down For: space excludes trees that cannot be derived from S Against: space includes trees that are not consistent with the input Bottom up For: space excludes states containing trees that cannot lead to input text segments. Against: space includes states containing subtrees that can never lead to an S node. Top Down vs Bottom UpSearch Spaces csa3180: Setence Parsing Algorithms 1
Top Down Parsing - Remarks • Top-down parsers do well if there is useful grammar driven control: search can be directed by the grammar. • Not too many different rules for the same category • Not too much distance between non terminal and terminal categories. • Top-down is unsuitable for rewriting parts of speech (preterminals) with words (terminals). In practice that is always done bottom-up as lexical lookup. csa3180: Setence Parsing Algorithms 1
Bottom Up Parsing - Remarks • It is data-directed: it attempts to parse the words that are there. • Does well, e.g. for lexical lookup. • Does badly if there are many rules with similar RHS categories. • Inefficient when there is great lexical ambiguity (grammar driven control might help here) • Empty categories: termination problem unless rewriting of empty constituents is somehow restricted (but then it’s generally incomplete) csa3180: Setence Parsing Algorithms 1
Basic Parsing Algorithms Top Down Bottom Up see Jurafsky & Martin Ch. 10 csa3180: Setence Parsing Algorithms 1
% Grammar rule(s,[np,vp]). rule(np,[d,n]). rule(vp,[v]). rule(vp,[v,np]). % Lexicon word(d,the). word(n,dog). word(n,cat). word(n,dogs). word(n,cats). word(v,chase). word(v,chases). Recoding the Grammar/Lexicon csa3180: Setence Parsing Algorithms 1
Top Down Depth First Recognitionin Prolog parse(C,[Word|S],S) :- word(C,Word). % word(noun,cat). parse(C,S1,S) :- rule(C,Cs), % rule(s,[np,vp]) parse_list(Cs,S1,S). parse_list([],S,S). parse_list([C|Cs],S1,S) :- parse(C,S1,S2), parse_list(Cs,S2,S). csa3180: Setence Parsing Algorithms 1
Bottom UpShift/Reduce Algorithm • Two data structures • input string • stack • Repeat until input is exhausted • Shift word to stack • Reduce stack using grammar and lexicon until no further reductions are possible • Unlike top down, algorithm does not require category to be specified in advance. It simply finds all possible trees. csa3180: Setence Parsing Algorithms 1
Shift/Reduce Operation →| Step Action Stack Input 0 (start) the dog barked 1 shift the dog barked 2 reduce d dog barked 3 shift dog d barked 4 reduce n d barked 5 reduce np barked 6 shift barked np 7 reduce v np 8 reduce vp np 9 reduce s csa3180: Setence Parsing Algorithms 1
parse(S,Res) :- sr(S,[],Res). sr(S,Stk,Res) :- shift(Stk,S,NewStk,S1), reduce(NewStk,RedStk), sr(S1,RedStk,Res). sr([],Res,Res). shift(X,[H|Y],[H|X],Y). reduce(Stk,RedStk) :- brule(Stk,Stk2), reduce(Stk2,RedStk). reduce(Stk,Stk). %grammar brule([vp,np|X],[s|X]). brule([n,d|X],[np|X]). brule([np,v|X],[vp|X]). brule([v|X],[vp|X]). %interface to lexicon brule([Word|X],[C|X]) :- word(C,Word). Shift/Reduce Implementation ↑ ↑ ↑ ↑ stack sent nstack nsent csa3180: Setence Parsing Algorithms 1
Shift/Reduce Operation • Words are shifted to the beginning of the stack, which ends up in reverse order. • The reduce step is simplified if we also store the rules backward, so that the rule s → np vp is stored as the factbrule([vp,np|X],[s|X]). • The term [a,b|X] matches any list whose first and second elements are a and b respectively. • The first argument directly matches the stack to which this rule applies • The second argument is what the stack becomes after reduction. csa3180: Setence Parsing Algorithms 1
Shift Reduce Parser • Standard implementations do not perform backtracking (e.g. NLTK) • Only one result is returned even when sentence is ambiguous. • May not fail even when sentence is grammatical • Shift/Reduce conflict • Reduce/Reduce conflict csa3180: Setence Parsing Algorithms 1
Handling Conflicts • Shift-reduce parsers may employ policies for resolving such conflicts, e.g. • For Shift/Reduce Conflicts • Prefer shift • Prefer reduce • For Reduce/Reduce Conflicts • Choose reduction which removes most elements from the stack csa3180: Setence Parsing Algorithms 1