790 likes | 1.03k Views
Statistical methods in NLP. Diana Trandabat 2013-2014. The sentence as a string of words E.g I saw the lady with the binoculars string = a b c d e b f. The relations of parts of a string to each other may be different I saw the lady with the binoculars is stucturally ambiguous
E N D
Statistical methods in NLP Diana Trandabat 2013-2014
The sentence as a string of wordsE.g I saw the lady with the binoculars string = a b c d e b f
The relations of parts of a string to each other may be different I saw the lady with the binoculars is stucturally ambiguous Who has the binoculars?
[I] saw the lady [ with the binoculars]= [a] b c d [e b f]I saw[ the lady with the binoculars]= a b [c d e b f]
How can we represent the difference? By assigning them different structures. We can represent structures with 'trees'. I read the book
a. I saw the lady with the binoculars S NPVPVNPNP PP I saw the ladywith the binocularsI saw [the lady with the binoculars]
b. I saw the lady with the binoculars S NPVPVP PP Isaw the ladywith the binocularsI[ saw the lady ] with the binoculars
birdsfly S NP VP N V birdsfly S → NP VP NP → N VP → V Syntactic rules
S NP VP birdsfly a b ab = string
S A B a b ab S → A B A → a B → b
Rules Assumption: natural language grammars are a rule-based systems What kind of grammars describe natural language phenomena? What are the formal properties of grammatical rules?
Chomsky (1957) Syntactic Structures. The Hague: Mouton Chomsky, N. and G.A. Miller (1958) Finite-state languages Information and Control 1, 99-112 Chomsky (1959) On certain formal properties of languages. Information and Control 2, 137-167
Rules in Linguistics1.PHONOLOGY /s/ → [θ] V ___VRewrite /s/ as [θ] when /s/ occurs in context V ____ VWith:V = auxiliary nodes, θ = terminal nodes
Rules in Linguistics2.SYNTAXS → NP VPVP → VNP → NRewrite S as NP VP in any contextWith:S, NP, VP= auxiliary nodesV, N = terminal node
SYNTAX (phrase/sentence formation) sentence: The boy kissed the girl Subject predicate noun phrase verb phrase art + noun verb + noun phrase S → NP VP VP → V NP NP → ART N
Chomsky Hierarchy 0. Type 0 (recursively enumerable) languages Only restrictionon rules: left-hand side cannot be the empty string (* Ø …….) 1. Context-Sensitive languages - Context-Sensitive (CS) rules 2. Context-Free languages - Context-Free (CF) rules 3. Regular languages - Non-Context-Free (CF) rules 0 ⊇ 1⊇ 2 ⊇ 3 a⊇b meaning a properly includes b (aisasupersetofb), i.e. b is a proper subset of a or b is in a
Generative power 0. Type 0 (recursively enumerable) languages • only restriction on rules: left-hand side cannot be the empty string (* Ø …….) - is the most powerful system 3. Type 3(regularlanguage) - is the least powerful
Superset/subset relation S1 S2 a c b d f g a b S1 is a subset of S2 ; S2 is a superset of S1
Rule Type – 3 Name: Regular Example:Finite State Automata (Markov-process Grammar) Rule type: a) right-linear AxB or A x with: A, B = auxiliary nodes and x = terminal node b) or left-linear ABx or A x Generates: ambn with m,n 1 Cannot guarantee that there are as many a’s as b’s; no embedding
A regular grammar for natural language sentences S →the A A → cat B A → mouse B A → duck B B → bites C B → sees C B → eats C C → the D D → boy D → girl D → monkey the cat bites the boy the mouse eats the monkey the duck sees the girl
Regular grammars Grammar 1: Grammar 2: A → a A → a A → a B A → B a B → b A B → A b Grammar 3: Grammar 4: A → a A → a A → a B A → B a B → b B → b B → b A B → A b Grammar 5: Grammar 6: S → a AA → A a S → b B A → B a A → a S B → b B → b b S B → A b S → A → a
Grammars: non-regular Grammar 6: Grammar 7: S → A B A → a S → b B A → B a A → a S B → b B → b b S B → b A S →
Finite-State Automaton article noun NP NP1 NP2 adjective
NP article NP1 adjective NP1 noun NP2 NP → article NP1 NP1 →adjective NP1 NP1 → noun NP2
A parse tree S root node NP VP non- terminal N V NP nodes DET N terminal nodes
Rule Type – 2 Name: Context Free Example: Phrase Structure Grammars/ Push-Down Automata Rule type: A with: A = auxiliary node = any number of terminal or auxiliary nodes Recursiveness(centre embedding) allowed: AA
CF Grammar A Context Free grammar consists of: a) a finite terminal vocabulary VT b) a finite auxiliary vocabulary VA c) an axiom S VA • a finite number of context free rules of form A → γ, where A VA and γ {VA VT}* In natural language syntax S is interpreted as the start symbol for sentence, as in S → NP VP
Natural language Is English regular or CF? If centre embedding is required, then it cannot be regular Centre Embedding: 1. [The cat] [likes tuna fish] a b 2. The cat the dog chased likes tuna fish a a b b 3. The cat the dog the rat bit chased likes tuna fish a a a bb b 4. The cat the dog the rat the elephant admired bit chased likes tuna fish a a a a b b b b ab aabb aaabbb aaaabbbb
[The cat] [likes tuna fish] a b 2. [The cat] [the dog] [chased] [likes ...] aa bb
Centre embedding S NP VP the likes cat tuna a b = ab
S NP VP likes NP S tuna the b cat NP VP a thechased dogb a = aabb
S NP VP likes NP Stuna the b cat NPVP a chased NPSb the dog NPVP athebit ratb a = aaabbb
Natural language 2 More Centre Embedding: 1. If S1, then S2 a a 2. Either S3, or S4 b b Sentence with embedding: If either the man is arriving today or the woman is arriving tomorrow, then the child is arriving the day after. a = [if b = [either the man is arriving today] b = [or the woman is arriving tomorrow]] a = [then the child is arriving the day after] = abba
CS languages The following languages cannot be generated by a CF grammar (by pumping lemma): anbmcndm Swiss German: A string of dative nouns (e.g. aa), followed by a string of accusative nouns (e.g. bbb), followed by a string of dative-taking verbs (cc), followed by a string of accusative-taking verbs (ddd) = aabbbccddd = anbmcndm
Swiss German: Jan sait das (Jan says that) … merem Hans esHuushälfedaastriiche we Hans/DAT the house/ACC helpedpaint we helped Hans paint the house abcd NPdatNPdatNPaccNPaccVdatVdatVaccVacc a a b b c c d d
Context Free Grammars (CFGs) Sets of rules expressing how symbols of the language fit together, e.g.S -> NP VPNP -> Det NDet -> theN -> dog
What Does Context Free Mean? • LHS of rule is just one symbol. • Can haveNP -> Det N • Cannot haveX NP Y -> X Det N Y
Grammar Symbols • Non Terminal Symbols • Terminal Symbols • Words • Preterminals
Non Terminal Symbols • Symbols which have definitions • Symbols which appear on the LHS of rulesS-> NP VPNP -> Det NDet -> theN-> dog
Non Terminal Symbols • Same Non Terminals can have several definitionsS-> NP VPNP -> Det N NP -> N Det -> theN-> dog
TerminalSymbols • Symbols which appear in final string • Correspond to words • Are not defined by the grammar S -> NP VPNP -> Det NDet -> theN -> dog
Parts of Speech (POS) • NT Symbols which produce terminal symbols are sometimes called pre-terminals S -> NP VPNP -> Det NDet -> theN-> dog • Sometimes we are interested in the shape of sentences formed from pre-terminalsDet N VAux N V D N
CFG - formal definition A CFG is a tuple (N,,R,S) • N is a set of non-terminal symbols • is a set of terminal symbols disjoint from N • R is a set of rules each of the form A where A is non-terminal • S is a designated start-symbol
grammar: S NP VP NP N VP V NP lexicon: V kicks N John N Bill N = {S, NP, VP, N, V} = {kicks, John, Bill} R = (see opposite) S = “S” CFG - Example
Exercise • Write grammars that generate the following languages, for m > 0 (ab)m anbm anbn • Which of these are Regular? • Which of these are Context Free?
(ab)m for m > 0 S -> a b S -> a b S
(ab)m for m > 0 S -> a b S -> a b S S -> a X X -> b Y Y -> a b Y -> S
S -> A B A -> a A -> a A B -> b B -> b B anbm
S -> A B A -> a A -> a A B -> b B -> b B S -> a AB AB -> a AB AB -> B B -> b B -> b B anbm