370 likes | 513 Views
Linguistics Methodology meets Language Reality:. the quest for robustness, scalability, and portability in (spoken) language applications. Bob Carpenter SpeechWorks International. The Standard Cliché(s). Moore’s Cliché:
E N D
Linguistics Methodology meets Language Reality: the quest for robustness, scalability, and portability in (spoken) language applications Bob Carpenter SpeechWorks International
The Standard Cliché(s) • Moore’s Cliché: • Exponential growth in computing power and memory will continue to open up new possibilities • The Internet Cliché: • With the advent and growth of the world-wide web, an ever increasing amount of information must be managed
More Standard Clichés • The Convergence Cliché: • Data, voice and video networking will be integrated over a universal network, that: • includes land lines and wireless; • includes broadband and narrowband • likely implementation is IP (internet protocol) • The Interface Cliché: • The three forces above (growth in computing power, information online, and networking) will both enable and require new interfaces • Speech will become as common as graphics
Some Comp Ling Clichés • The Standard Linguist’s Cliché • But it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. • Noam Chomsky, 1969 [essay on Quine] • The Standard Engineer’s Cliché • Anytime a linguist leaves the group the recognition rate goes up. • Fred Jelinek, 1988 [address to DARPA]
The “Theoretical Abstraction” • mature, monolingual, native language speaker • idealized to complete knowledge of language • static, homogenous language community • all speakers learn identical grammars • “competence” (vs. “performance”) • “performance” is a natural class • wetware “implementation” follows theory in divorcing “knowledge of language” from processing • assumes the existence and innateness of a “language faculty”
The Explicit Methodology • “Emprical” Basis is binary grammaticality judgements • “intuitive” (to a “properly” trained linguist) • innateness and the “language faculty” • appropriate for phonetics through dialogue • in practice, very little agreement at boundaries and no standard evaluations of theories vs. data • Models of particular languages • by grammars that generate formal languages • low priority for transformationalists • high priority for monostratalists/computationalists
The Holy Grail of Linguistics • A grammar meta-formalism in which • all and only natural language grammars (idealized as above) can be expressed • assumed to correspond to the “language faculty” • Grail is sought by every major camp of linguist • Explains why all major linguistic theories look alike from any perspective outside of a linguistics department • The expedient abstractions have become an end in themselves
But, Applications Require • Robustness • acoustic and linguistic variation • disfluencies and noise • Scalability • from embedded devices to palmtops to clients to servers • across tasks from simple to complex • system-initiative form-filling to mixed initiative dialogue • Portability • simple adaptation to new tasks and new domains • preferably automated as much as possible
The $64,000 Question • How do humans handle unrestricted language so effortlessly in real time? • Unfortunately, the “classical” linguistic assumptions and methodology completely ignore this issue • Psycholinguistics has uncovered some baselines: • lexicon (and syntax?): highly parallel • time course of processing: totally online • information integration: <= 200ms for all sources • But is short on explanations
(AI) Success by Stupidity • Jaime Carbonell’s Argument (ECAI, mid 1990s) • Apparent “intelligence” because they’re too limited to do anything wrong: “right” answer hardcoded • Typical in Computational NL Grammars • lexicon limited to demo • rules limited to common ones (eg: no heavy shift) • Scaling up usually destroys this limited “success” • 1,000,000s of “grammatical” readings with large grammars
Eyes track Semantic resolution ~200 ms tracking time My Favorite Experiments: I • Mike Tanenhaus et al. (Univ. Rochester) • Head-Mounted Eye Tracking Pick up the yellow plate Clearly shows that understanding is online
My Favorite Experiments (II) • Garden Paths and Context Sensitive • Crain & Steedman (U.Connecticut & U. Edinburgh) • if noun is not unique in context, postmodificiation is much more likely than if noun picks out unique individual • Garden Paths are Frequency and Agreement Sensitive • Tanenhaus et al. • The horse raced past the barn fell. (raced likely past) • The horses brought into the barn fell. (brought likely participle, and less likely activity for horses)
Stats: Explanation or Stopgap • A Common View • Statistics are some kind of approximation of underlying factors requiring further explanation. • Steve Abney’s Analogy (AT&T Labs) • Statistical Queueing Theory • Consider traffic flows through a toll gate on a highway. • Underlying factors are diverse, and explain the actions of each driver, their cars, possible causes of flat tires, drunk drivers, etc. • Statistics is more insightful [explanatory] in this case as it captures emergent generalizations • It is a reductionist error to insist on low-level account
Competence vs. Performance • What is computed vs. how it is computed • The what can be traditional grammatical structure • All structures not computed, regardless of the how • Define what probabilistically, independently of how
Algebraic vs. Statistical • False Dichotomy • All statistical systems have an algebraic basis, even if trivial • The Good News: • Best statistical systems have best linguistic conditioning (most “explanatory” in traditional sense) • Statistical estimatiors far less significant than the appropriate linguistic conditioning • Rest of the talk provides examples of this
Bayesian Statistical Modeling • Concerned with prior and posterior probabilities • Allows updates of reasoning • Bayes’ Law: P(A,B) = P(A|B) P(B) = P(B|A) P(A) • Eg: Source/Channel Model for Speech Recognition • Ws: sequence of words • As: sequence of acoustic observations • Compute ArgMax_Ws P(Ws|As) ArgMax_Ws P(Ws|As) = ArgMax_Ws P(As|Ws) P(Ws) / P(As) = ArgMax_Ws P(As|Ws) P(Ws) P(As|Ws) : acoustic model P(Ws) : language model
Simple Bayesian Update Example • Monty Hall’s Let’s Make a Deal • Three curtains with prize behind one, no other info • Contestant chooses one of three • Monty then opens curtain of one of others that does not have the prize • if you choose curtain 2, then one of curtain 1 or 3 must not contain prize • Monty then lets you either keep your first guess, or change to the remaining curtain he didn’t open. • Should you switch, stay, or doesn’t it matter?
prize behind you select Switch P(win) = 2/3 Stay P(win) = 1/3 Answer • Yes! You should switch. • Why? Consider possiblities:
Defaults via Bayesian Inference • Bayesian Inference provides an explanation for “rationality” of default reasoning • Reason by choosing an action to maximize expected payoff given some knowledge • ArgMax_Action Payoff(Action) * P(Action|Knowledge) • Given additional information update to Knowledge’ • ArgMax_Action Payoff(Action) * P(Action|Knowledge’) • Chosen action may be different, as in Let’s Make a Deal • Inferences are not logically sound, but are “rational” • Bayesian framework integrates partiality and uncertainty of background knowledge
Example: Allophonic Variation • English Pronunciation (M. Riley & A. Llolje, AT&T) • Derived from TIMIT with phoneme/phone labels • orthographic: bottle • phonological: / b aa t ax l / (ARPAbet phonemes) • phonetic: 0.75 [ b aa dx el ] (TIMITbet phones) • 0.13 [ b aa t el ] • 0.10 [ b aa dx ax l ] • 0.02 [ b aa t ax l ] • Allophonic variation is non-deterministic
Eg: Allophonic Variation (cont’d) • Simple statistical model (simplified w/o insertion) • Estimate probability of phones given phonemes: P(a1,…,aM|p1,…,pM) = P(a1|p1,…,pM) * P(a2|p1,…,pM,a1) * … * * P(aM|p1,…,pM,a1,…,aM-1) • Approximate phoneme context to +/- k phones • Approximate phone history to 0 or 1 phones • 0: … P(aJ|pJ-K,…,pJ,…,pJ+K) ... • 1: … P(aJ|pJ-K,…,pJ,…,pJ+K, aJ-1) … • Uses word boundary marker and stress
Eg: Allophonic Variation (concl’d) • Cluster phonological features using decision trees • Sparse data smoothed by decision trees over standard features (+/- stop, voicing, aspiration, etc.) • Conditional entropy: w/o context 1.5 bits, w 0.8 • Most likely allophone correct 85.5%, in top 5, 99% • Average 17 pronunciations/word to get 95% • Robust: handles multiple pronunciations • Scalable: to whole of English pronunciation • Portable: easy to move to new dialects with training • K. Knight (ISI): similar techniques for Japenese pronunciation of English words!
Example: Co-articulation • HMMs have been applied to speech since mid-70s • Two major recent improvements, the first being simply more training data and cycles • Second is: Context-dependent triphones • Instead of one HMM per phoneme/phone, use one per context-dependent triphone • example: t-r+u ‘an r preceded by t and followed by u’ • crucially clustered by phonological features to overcome sparsity
Exploratory Data Analysis (Trendier: data mining; Trendiest: information harvesting) • Specious Argument: A statistical model won’t help explain linguistic processes. • Counter 1: Abney’s anti-reductionist • But even if you don’t believe that: • Counter 2: In “other sciences” (pace linguistic tradition), statistics is used to discover regularities • Allophone example: “had your” pronunciation • / d / is 51%likely to realize as [ jh ], 37% as [ d ] • if / d / realizes as [ jh ], / y / deletes 84% • if / d / realizes as [ d ], / y / deletes 10%
Balancing Gricean Maxims • Grice gives us conflicting maxims: • quantity (exactly as informative as required) • quality (try to make your contribution true) • manner (be perspicuous; eg. avoid ambiguity, be brief) • Manner pulls in opposite directions • quality without ambiguity lengthens statements • quantity and and (part of) manner require brevity • Balance by estimating a multidimensional “goodness” metric for generation
Gricean Balance (cont’d) • Consider problem for aggregation in generation • Every student ran slowly or walked quickly. Aggregates to: • Every student ran slowly or every student walked quickly. • This reduces sentence length, shortens clause length, and increases ambiguity. • These tradeoffs need to be balanced
Collins’ Head/Dependency Parser • Michael Collins 1998 UPenn PhD thesis • Parses WSJ with ~90% constituent precision/recall • Generative model of tree probabilities • Clever Linguistic Decomposition and Training • P(RootCat, HeadTag, HeadWord) • P(DaughterCat|MotherCat, HeadTag, HeadWord) • P(SubCat|MotherCat, DtrCat, HeadTag, HeadWord) • P(ModifierCat, ModiferTag, ModifierWord | SubCat, MotherCat, DaughterCat, HeadTag, HeadWord, Distance)
Eg: Collins’ Parser (cont’d) • Distance encodes heaviness • Adjunct vs. Complement modifiers distinguished • Head Words and Tags model lexical variation and word-word attachment preferences • Also conditions punctuation, coordination, UDCs • 12,000 word vocabulary plus unknown word attachment model (by Collins) and tag model (by A. Ratnaparkhi, another 1998 UPenn thesis) • Smoothed by backing off words to categories • Trivial statistical estimators; power is conditioning
Computational Complexity • Wide coverage linguistic grammar generate millions of readings • But Collins’ parser runs faster than real time on a notebook on unseen sentences of length up to 100 • How? Pruning. • Collins’ found tighter statistical estimates of tree likelihoods with more features and more complex grammars ran faster because a tighter beam could be used • (E. Charniak & S. Caraballo at Brown have really pushed the envelope here)
Complexity (cont’d) • Collins’ parser is not complete in the usual sense • But neither are humans (eg. garden paths) • Can trade speed for accuracy in statistical parsers • Syntax is not processed autonomously • Humans can’t parse without context, semantics, etc. • Even phone or phoneme detection is very challenging, especially in a noisy environment • Top-down expectations and knowledge of likely bottom-up combinations prune the vast search space on line • Question is how to combine it with other factors
Austin today flights from Boston for pay Boston lights to for N-best and Word Graphs • Speech recognizers can return n-best histories • flights from Boston today • flights from Austin today • flights for Boston to pay • lights for Boston to pay • Can also return a packed word graph of histories; sum of path log probs equal acoustics / word-string joint log prob
Probabilistic Graph Processing • The architecture we’re exploring in the context of spoken dialogue systems involves: • Speech recognizers that produce probabilistic word graph output • A tagger that transforms a word graph into a word/tag graph with scores given by joint probabilities • A parser that transforms a word/tag graph into a graph-based chart (as in CKY or chart parsing) • Allows each module to rescore output of previous module’s decision • Apply this architecture to speech act detection, dialogue act selection, and in generation
rose:VBD sharply:RB after:RB prices: NN hours:NNS after:IN sharply:RB after:RB rose:VBD sharply:RB rose:VBP prices:NNS hours:NNS after:IN rose:NN sharply:RB after:IN rose:NNP after:IN sharply:RB Prices rose sharply after hours15-best as a word/tag graph + minimization
Challenge: Beat n-grams • Backed off trigram models estimated from 300M words of WSJ provide best language models • We know there is more to language than two words of history • Challenge is to find out how to model it.
Conclusions • Need ranking of hypotheses for applications • Beam can reduce processing time to linear • need good statistics to do this • More linguistic features are better for stat models • can induce the relevant ones and weights from data • linguistic rules emerge from these generalizations • Using acoustic / word / tag / syntax graphs allows the propogation of uncertainty • ideal is totally online (model is compatible with this) • approximation allows simpler modules to do first pruning
Plugs Run, don’t walk, to read: • Steve Abney. 1996. Statistical methods and linguistics. In J. L. Klavans and P. Resnik, eds., The Balancing Act. MIT Press. • Mark Seidenberg and Maryellen MacDonald. 1999. A probabilistic constraints approach to language acquisition and processing. Cognitive Science. • Dan Jurafsky and James H. Martin. 2000. Speech and Language Processing. Prentice-Hall. • Chris Manning and Hinrich Schuetze. 1999. Statistical Natural Language Processing. MIT Press.