1 / 78

Statistical methods in NLP

Statistical methods in NLP. Diana Trandabat 2013-2014. The sentence as a string of words E.g I saw the lady with the binoculars string = a b c d e b f. The relations of parts of a string to each other may be different I saw the lady with the binoculars is stucturally ambiguous

lala
Download Presentation

Statistical methods in NLP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistical methods in NLP Diana Trandabat 2013-2014

  2. The sentence as a string of wordsE.g I saw the lady with the binoculars string = a b c d e b f

  3. The relations of parts of a string to each other may be different I saw the lady with the binoculars is stucturally ambiguous Who has the binoculars?

  4. [I] saw the lady [ with the binoculars]= [a] b c d [e b f]I saw[ the lady with the binoculars]= a b [c d e b f]

  5. How can we represent the difference? By assigning them different structures. We can represent structures with 'trees'. I read the book

  6. a. I saw the lady with the binoculars S NPVPVNPNP PP I saw the ladywith the binocularsI saw [the lady with the binoculars]

  7. b. I saw the lady with the binoculars S NPVPVP PP Isaw the ladywith the binocularsI[ saw the lady ] with the binoculars

  8. birdsfly S NP VP N V birdsfly S → NP VP NP → N VP → V Syntactic rules

  9. S NP VP birdsfly a b ab = string

  10. S A B a b ab S → A B A → a B → b

  11. Rules Assumption: natural language grammars are a rule-based systems What kind of grammars describe natural language phenomena? What are the formal properties of grammatical rules?

  12. The Chomsky Hierarchy

  13. Chomsky (1957) Syntactic Structures. The Hague: Mouton Chomsky, N. and G.A. Miller (1958) Finite-state languages Information and Control 1, 99-112 Chomsky (1959) On certain formal properties of languages. Information and Control 2, 137-167

  14. Rules in Linguistics1.PHONOLOGY /s/ → [θ]  V ___VRewrite /s/ as [θ] when /s/ occurs in context V ____ VWith:V = auxiliary nodes, θ = terminal nodes

  15. Rules in Linguistics2.SYNTAXS → NP VPVP → VNP → NRewrite S as NP VP in any contextWith:S, NP, VP= auxiliary nodesV, N = terminal node

  16. SYNTAX (phrase/sentence formation) sentence: The boy kissed the girl Subject predicate noun phrase verb phrase art + noun verb + noun phrase S → NP VP VP → V NP NP → ART N

  17. Chomsky Hierarchy 0. Type 0 (recursively enumerable) languages Only restrictionon rules: left-hand side cannot be the empty string (* Ø …….) 1. Context-Sensitive languages - Context-Sensitive (CS) rules 2. Context-Free languages - Context-Free (CF) rules 3. Regular languages - Non-Context-Free (CF) rules 0 ⊇ 1⊇ 2 ⊇ 3 a⊇b meaning a properly includes b (aisasupersetofb), i.e. b is a proper subset of a or b is in a

  18. Generative power 0. Type 0 (recursively enumerable) languages • only restriction on rules: left-hand side cannot be the empty string (* Ø  …….) - is the most powerful system 3. Type 3(regularlanguage) - is the least powerful

  19. Superset/subset relation S1 S2 a c b d f g a b S1 is a subset of S2 ; S2 is a superset of S1

  20. Rule Type – 3  Name: Regular  Example:Finite State Automata (Markov-process Grammar) Rule type: a) right-linear AxB or A  x with: A, B = auxiliary nodes and x = terminal node b) or left-linear ABx or A  x Generates: ambn with m,n  1 Cannot guarantee that there are as many a’s as b’s; no embedding

  21. A regular grammar for natural language sentences S →the A A → cat B A → mouse B A → duck B B → bites C B → sees C B → eats C C → the D D → boy D → girl D → monkey the cat bites the boy the mouse eats the monkey the duck sees the girl

  22. Regular grammars Grammar 1: Grammar 2: A → a A → a A → a B A → B a B → b A B → A b Grammar 3: Grammar 4: A → a A → a A → a B A → B a B → b B → b B → b A B → A b Grammar 5: Grammar 6: S → a AA → A a S → b B A → B a A → a S B → b B → b b S B → A b S →  A → a

  23. Grammars: non-regular Grammar 6: Grammar 7: S → A B A → a S → b B A → B a A → a S B → b B → b b S B → b A S → 

  24. Finite-State Automaton article noun NP NP1 NP2 adjective

  25. NP article NP1 adjective NP1 noun NP2 NP → article NP1 NP1 →adjective NP1 NP1 → noun NP2

  26. A parse tree S root node NP VP non- terminal N V NP nodes DET N terminal nodes

  27. Rule Type – 2 Name: Context Free Example: Phrase Structure Grammars/ Push-Down Automata Rule type: A with: A = auxiliary node  = any number of terminal or auxiliary nodes Recursiveness(centre embedding) allowed: AA

  28. CF Grammar  A Context Free grammar consists of: a) a finite terminal vocabulary VT b) a finite auxiliary vocabulary VA c) an axiom S  VA • a finite number of context free rules of form A → γ, where A  VA and γ  {VA VT}* In natural language syntax S is interpreted as the start symbol for sentence, as in S → NP VP

  29. Natural language Is English regular or CF? If centre embedding is required, then it cannot be regular Centre Embedding: 1. [The cat] [likes tuna fish] a b 2. The cat the dog chased likes tuna fish a a b b 3. The cat the dog the rat bit chased likes tuna fish a a a bb b 4. The cat the dog the rat the elephant admired bit chased likes tuna fish a a a a b b b b  ab aabb aaabbb aaaabbbb

  30. [The cat] [likes tuna fish] a b 2. [The cat] [the dog] [chased] [likes ...] aa bb

  31. Centre embedding S NP VP the likes cat tuna a b = ab

  32. S NP VP likes NP S tuna the b cat NP VP a thechased dogb a = aabb

  33. S   NP VP likes NP Stuna the b cat NPVP a chased NPSb the dog NPVP athebit ratb a = aaabbb

  34. Natural language 2 More Centre Embedding: 1. If S1, then S2 a a 2. Either S3, or S4 b b Sentence with embedding: If either the man is arriving today or the woman is arriving tomorrow, then the child is arriving the day after. a = [if b = [either the man is arriving today] b = [or the woman is arriving tomorrow]] a = [then the child is arriving the day after] = abba

  35. CS languages The following languages cannot be generated by a CF grammar (by pumping lemma): anbmcndm Swiss German: A string of dative nouns (e.g. aa), followed by a string of accusative nouns (e.g. bbb), followed by a string of dative-taking verbs (cc), followed by a string of accusative-taking verbs (ddd) = aabbbccddd = anbmcndm

  36. Swiss German: Jan sait das (Jan says that) … merem Hans esHuushälfedaastriiche we Hans/DAT the house/ACC helpedpaint we helped Hans paint the house abcd NPdatNPdatNPaccNPaccVdatVdatVaccVacc a a b b c c d d

  37. Context Free Grammars (CFGs) Sets of rules expressing how symbols of the language fit together, e.g.S -> NP VPNP -> Det NDet -> theN -> dog

  38. What Does Context Free Mean? • LHS of rule is just one symbol. • Can haveNP -> Det N • Cannot haveX NP Y -> X Det N Y

  39. Grammar Symbols • Non Terminal Symbols • Terminal Symbols • Words • Preterminals

  40. Non Terminal Symbols • Symbols which have definitions • Symbols which appear on the LHS of rulesS-> NP VPNP -> Det NDet -> theN-> dog

  41. Non Terminal Symbols • Same Non Terminals can have several definitionsS-> NP VPNP -> Det N NP -> N Det -> theN-> dog

  42. TerminalSymbols • Symbols which appear in final string • Correspond to words • Are not defined by the grammar S -> NP VPNP -> Det NDet -> theN -> dog

  43. Parts of Speech (POS) • NT Symbols which produce terminal symbols are sometimes called pre-terminals S -> NP VPNP -> Det NDet -> theN-> dog • Sometimes we are interested in the shape of sentences formed from pre-terminalsDet N VAux N V D N

  44. CFG - formal definition A CFG is a tuple (N,,R,S) • N is a set of non-terminal symbols •  is a set of terminal symbols disjoint from N • R is a set of rules each of the form A   where A is non-terminal • S is a designated start-symbol

  45. grammar: S  NP VP NP  N VP  V NP lexicon: V  kicks N  John N  Bill N = {S, NP, VP, N, V}  = {kicks, John, Bill} R = (see opposite) S = “S” CFG - Example

  46. Exercise • Write grammars that generate the following languages, for m > 0 (ab)m anbm anbn • Which of these are Regular? • Which of these are Context Free?

  47. (ab)m for m > 0 S -> a b S -> a b S

  48. (ab)m for m > 0 S -> a b S -> a b S S -> a X X -> b Y Y -> a b Y -> S

  49. S -> A B A -> a A -> a A B -> b B -> b B anbm

  50. S -> A B A -> a A -> a A B -> b B -> b B S -> a AB AB -> a AB AB -> B B -> b B -> b B anbm

More Related