270 likes | 542 Views
CSA2050 Introduction to Computational Linguistics. Parsing I. Why Is Syntax Important?. The presidential candidate who was extremely popular smiled broadly. How many presidential candidates are implied? 1 or >1?. Why Is Syntax Important?.
E N D
Why Is Syntax Important? • The presidential candidate who was extremely popular smiled broadly. • How many presidential candidates are implied? • 1 or >1? CSA2050 - Parsing I
Why Is Syntax Important? • The presidential candidate, who was extremely popular, smiled broadly. • How many presidential candidates are implied? • 1 or >1? CSA2050 - Parsing I
Why Is Syntax Important? • The presidential candidate, who was extremely popular, smiled broadly. • The presidential candidate who was extremely popular smiled broadly. …because the syntactic structure has an important bearing on the meaning CSA2050 - Parsing I
PP Attachment • The policeman saw a burglar with a gun • The policemen saw a burglar with a telescope • PP can modify V or N • In the first case, it modifes V • In the second, it modifies N CSA2050 - Parsing I
PP modifies V S NP VP NP PP NP D N V D N P D N The policemen saw the burglar with a telescope CSA2050 - Parsing I
PP modifies N S NP VP NP PP NP D N V D N P D N The policemen saw a burglar with a gun CSA2050 - Parsing I
Issue • In general, how can we determine whether a prepositional phrase modifies the preceding noun or verb? • Knowledge based approach must encode, for example • burglars often have guns • people can see things with a telescope • + a lot of other things • Statistical approach CSA2050 - Parsing I
PP Attachment – Statistical Approach • The Prepositional Phrase Attachment Corpus, included with NLTK as ppattach, makes it possible for us to study this question systematically. • Derived from the IBM-Lancaster Treebank of Computer Manuals and the Penn Treebank, • Distils only the essential information about PP attachment. CSA2050 - Parsing I
Corpus Example Sentence • Original • Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer. • including three with recently diagnosed cancerversus • including three by adding two and one CSA2050 - Parsing I
Distilled Information in Corpus • Original • Four of the five surviving workers have asbestos-related diseases, including three with recently diagnosed cancer. • ppattach corpus • 16 including three with cancer N i/d head verb head of obj prep head of pp’s np N or V CSA2050 - Parsing I
Further examples • 47830 allow visits between families N • 47830 allow visits on peninsula V • 42457 acquired interest in firm N • 42457 acquired interest in 1986 V Etc. CSA2050 - Parsing I
Minimal Pair Extraction • NLTK contains primitives that allow us to to extract minimal pairs where we hold NP1, PREP and NP2 constant and get different attachments with respect to verb, e.g. received (NP offer) (PP from group) V rejected (NP offer (PP from group)) N • receive x from y • reject x CSA2050 - Parsing I
Why Syntactic Structure? • Helps to make explicit how a sentence says who did what to whom The fierce dog bit the man • Key idea is to identify noun phrases around the verb <noun group> <verb> <noun group> • We can do this in terms of sequences of POS tags, e.g. D JJ* N • But there are limitations to this approach The child with a fierce dog bit the man • Here child is biting but D JJ* N still precedes “bit” so fierce dog remains the thing doing the biting. CSA2050 - Parsing I
Constituency • We could repair with a more complex regular expression such as DT JJ* NN (IN DT JJ* NN)* • But this is defeated by The seagull that attacked the child with the fierce dog bit the man • Basic problem is that we need a richer notion of constituency – how the words fit together to form a noun phrase. CSA2050 - Parsing I
Recursion – Central Embedding • The dog barked CSA2050 - Parsing I
Recursion – Central Embedding • The dog barked • The dog the cat scratched barked CSA2050 - Parsing I
Recursion – Central Embedding • The dog barked • The dog the cat scratched barked • The dog the cat the horse liked scratched barked. CSA2050 - Parsing I
Recursion – Central Embedding • The dog barked • The dog the cat scratched barked • The dog the cat the horse liked scratched barked. • The dog the cat the horse the man rode liked scratched barked. CSA2050 - Parsing I
Chomsky Hierarchy CSA2050 - Parsing I
CFG Review A CFG is a 4-tuple (N, Σ, P, S), where: • N is a set of non-terminal symbols (the category labels); • Σ is a set of terminal symbols (e.g., lexical items); • P is a set of productions of the form A → α, where • – A is a non-terminal, and • – α is a string of symbols from (N U Σ)* (i.e., strings of either terminals or non-terminals); • S is the start symbol. • A derivation of a string from a non-terminal N in P is the result or trace of successively applying individual productions in P to A. CSA2050 - Parsing I
Derivation 1 NP Det N PP the N PP the dog PP the dog P NP the dog with NP the dog with Det N the dog with a N the dog with a telescope Derivation 2 NP Det N PP Det N P NP Det N with NP The N with NP The N with a N Different Derivations for the Same Sentence CSA2050 - Parsing I
What Does Context Free Mean? • LHS of rule is just one symbol. • Can haveNP -> Det N • Cannot haveX NP Y -> X Det N Y CSA2050 - Parsing I
Grammar Symbols • Symbols of the grammar fall into three categories: • Non Terminal Symbols • Terminal Symbols • Parts of Speech • We will sometimes not distinguish between 2 and 3 CSA2050 - Parsing I
Technical Aspects of CFGs • Rules of the form LHS-> RHS • LHS comprises at most one NT symbol • RHS any combination of NT and T symbols • Finite State (type 3) grammars have different restrictions • LHS comprises at most one NT symbol • RHS combination of T symbols with at most one NT. • Right linear grammar: NT must come at extreme left • Left linear grammar: NT must come at extreme right CSA2050 - Parsing I
NP VP N V NP N John kicks Bill A Simple Grammar + Lexicon grammar: S NP VP NP N VP V NP lexicon: V kicks N John N Bill S CSA2050 - Parsing I
Grammar versus Parser • A grammar/lexicon defines a relation between sentences generated by the grammar and their respective syntactic structures. • The grammar does not tell us how to actually go about discovering the structure of a sentence. • A parsing algorithm is an effective procedure for carrying out that discovery. • A parser implements a parsing algorithm. • Recursive descent parsing. CSA2050 - Parsing I