990 likes | 1.26k Views
Formal Languages, Grammars, Regex, & Automata. Shallow Processing Techniques for NLP Ling 570 October 3, 2011. Roadmap. Motivation: Defining a language Formal languages Regular languages Regular expressions, formally Formal grammars: Regular grammars, Context-free grammars
E N D
Formal Languages, Grammars, Regex, & Automata Shallow Processing Techniques for NLP Ling 570 October 3, 2011
Roadmap • Motivation: • Defining a language • Formal languages • Regular languages • Regular expressions, formally • Formal grammars: • Regular grammars, Context-free grammars • Finite-State Automata
What’s a Language? • How can we define a language?
What’s a Language? • How can we define a language? • Strings in a language: • He put the map on the table. • The quick brown fox jumped over the lazy dog. • The sky is blue.
What’s a Language? • How can we define a language? • Strings in a language: • He put the map on the table. • The quick brown fox jumped over the lazy dog. • The sky is blue. • Strings not in the language: • *Green furiously colorless sleep ideas. • *The the on the. • *sdfsdfoiumerwweokc.
What’s in a Language? • What are all the pronunciations of words in a language?
What’s in a Language? • What are all the pronunciations of words in a language? • Some sounds in a language: • ah b aw • ah b aw t • t ah m ey t ow • t ah m aa t ow
What’s in a Language? • What are all the pronunciations of words in a language? • Some sounds in a language: • ah b aw • ah b aw t • t ah m ey t ow • t ah m aa t ow • Some sounds not in the languages: • k k t p k • m g p n aa
Defining Language • A language is defined as all and only those acceptable strings in the language.
Defining Language • A language is defined as all and only those acceptable strings in the language. • How can we describe the language?
Defining Language • A language is defined as all and only those acceptable strings in the language. • How can we describe the language? • Enumerate?
Defining Language • A language is defined as all and only those acceptable strings in the language. • How can we describe the language? • Enumerate? • Problems • Languages are infinitely productive • Inefficient • Misses basic regularities
Better Definitions • Grammars: • Start symbol • Expand with rewrite rules • Stop at word strings
Better Definitions • Grammars: • Start symbol • Expand with rewrite rules • Stop at word strings • Automata: • Start in start state • Transition to other states • Until reach final state
Better Definitions • Grammars: • Start symbol • Expand with rewrite rules • Stop at word strings • Automata: • Start in start state • Transition to other states • Until reach final state • Generate/recognize strings in language
Better Definitions • Grammars: • Start symbol • Expand with rewrite rules • Stop at word strings • Automata: • Start in start state • Transition to other states • Until reach final state • Generate/recognize strings in language • Reject those not in language
Acoustic Model P(signal|words) words -> phones + phones -> vector quantiz’n Words -> phones Pronunciation dictionary lookup Multiple pronunciations? Probability distribution Dialect Variation: tomato +Coarticulation Product along path aa t ow m t ow ey ow aa t m t ow ax ey 0.5 0.5 0.2 0.5 0.5 0.8
Pronunciation Example • Observations: 0/1
Formal Languages • Formal language: Model that can recognize/generate all and only strings a formal language act as a definition of the language
Formal Languages • Formal language: Model that can recognize/generate all and only strings a formal language act as a definition of the language • Alphabet: Finite set of symbols • Σ= {a, b, c}
Formal Languages • Formal language: Model that can recognize/generate all and only strings a formal language act as a definition of the language • Alphabet: Finite set of symbols • Σ= {a, b, c} • String: Finite sequence of symbols from alphabet • “aababc” • Empty string: ε
Formal Languages • Formal language: Model that can recognize/generate all and only strings a formal language act as a definition of the language • Alphabet: Finite set of symbols • Σ= {a, b, c} • String: Finite sequence of symbols from alphabet • “aababc” • Empty string: ε • Formal language: Set of strings defined over alphabet • {aa, bb, cc, aaaa, bbbb } • {anbn| n > 0} • Empty set ϕ
Kleene Closure • L2=L L • Ln = Ln-1 L • L* = {ε} U L1 U L2U ….
Kleene Closure • L2=L L • Ln = Ln-1 L • L* = {ε} U L1 U L2U …. • E.g. • L = {a,b} • L2 = {aa, ab, bb, ba} • L* = {ε, a,b,aa,aaa,aaaa,abab}
Regular Languages • Closed under • Concatenation: • Union/Disjunction: U • Kleene star: *
Regular Languages • Closed under • Concatenation: • Union/Disjunction: U • Kleene star: * • Also • Intersection: If L1 and L2 R.L.s, then R.L. • Difference: If L1 and L2 R.L.s, then L1-L2 is R.L. • Complementation: if L1 is R.L., then Σ*-L is R.L. • Reversal
Regular Languages? • Any finite set of strings? • {xxR} • {a*b*} • {anbn| n > 0} • {anbncn| n > 0}
Regular Expressions(as a Formal Language) • εis a regular expression • is a regular expression • If r1 and r2 are regular expressions, then • r1 r2 is a regular expression, • r1 | r2 is a regular expression, • and r1* is a regular expression
Basic Regular Expressions • Examples: • ab*c • a (0|1) b • C? V N?, where C is consonant, V is vowel, N is nasal
Basic Regular Expressions • Examples: • ab*c • a (0|1) b • C? V N?, where C is consonant, V is vowel, N is nasal • Others: • +: 1 or more • a?: 0 or 1 • . : wildcard • [0123]: disjunction • [^0123]: disjunctive negation
More Complex RegEx Examples: \d+ dollars = 10 dollars, 105 dollars, etc Escape: \ : turns off special characters; \\ (backslashitis)
Searching for ‘the’ • Idea: • /the/
Searching for ‘the’ • Idea: • /the/ • Idea 2: • /[Tt]he/
Searching for ‘the’ • Idea: • /the/ • Idea 2: • /[Tt]he/ • Idea 3: • /\b[Tt]he\b/
Searching for ‘the’ • Idea: • /the/ • Idea 2: • /[Tt]he/ • Idea 3: • /\b[Tt]he\b/ • Balancing: • Improving coverage (lower miss rate, aka Type 2 error)
Searching for ‘the’ • Idea: • /the/ • Idea 2: • /[Tt]he/ • Idea 3: • /\b[Tt]he\b/ • Balancing: • Improving coverage (lower miss rate, aka Type 2 error) • Improving precision (lower false alarm, aka Type 1 error)
Equivalences • Every regular language can be obtained from a regular expression. • Every regular expression can be associated with a regular language.
Representation:Formal Grammars • A formal grammar is a concise description of a formal language
Representation:Formal Grammars • A formal grammar is a concise description of a formal language • Grammars: 4-tuple • A set of terminal symbols: Σ
Representation:Formal Grammars • A formal grammar is a concise description of a formal language • Grammars: 4-tuple • A set of terminal symbols: Σ • A set of non-terminal symbols: N