240 likes | 250 Views
Statistical techniques in NLP. Vasileios Hatzivassiloglou University of Texas at Dallas. Learning. Central to statistical NLP In most cases, supervised methods are used, with a separate training set Unsupervised methods (clustering) recalculate the entire model on new data.
E N D
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas
Learning • Central to statistical NLP • In most cases, supervised methods are used, with a separate training set • Unsupervised methods (clustering) recalculate the entire model on new data
Parameterized models • Assume that the observed (training) data D is described by a given distribution • This distribution, possibly with some parameters , is our model . • We want to maximize the likelihood function, P(D|) or P(D|).
Maximum likelihood estimation • Find the that maximizes P(D|), i.e., • Example: Binomial distribution • P(D|m) = • Therefore, m=D/N
Smoothing • MLE assigns zero probability to unseen events • Example: trigrams in part of speech tagging (23% unseen) • Solution: smoothing (small probabilities for unseen data)
Bayesian learning • It is often impossible to solve • Bayes decision rule: choose that maximizes P(|D) (minimum error rate) • But it may be hard to calculate P(|D) • Use Bayes’ rule: • Naïve Bayes:
Examples • Gale et al 1992, 90% sense disambiguation accuracy (choose between “bank/money” and “bank/river”) • Hanks and Rooth 1990, prepositional phrase attachment • He ate pasta with cheese • He ate pasta with a fork • Both rely on observable features (nearby words, the verb)
Markov models • A stochastic process follows a sequence of states over time with some transition probabilities • If the process is stationary and with limited memory, we have a Markov chain • The model can be visible, or with hidden states (HMM)
Example: N-gram language models • Result for a word depends only on the word and a limited number of neighbors • Part-of-speech tagging: maximize • With Bayes rule, chain rule, and independence assumptions • Use HMM for automatically adjusting back-off smoothing
Example: Speech recognition • Need to find correct sequence of words given aural signal • Language model (N-gram) accounts for dependencies between words • Acoustic model maps from visible (phonemes) to hidden (words) level • HMM combines both • Viterbi algorithm will find optimal solution
Estimation-Maximization • In general, we can iteratively estimate complex models with hidden parameters • Define a quality function Q as the conditional likelihood of the model on all parameters • Estimate Q from an initial choice for • Choose new that maximizes Q
Example: PCFG parsing • Probabilistic context-free grammars • Likelihood of each rule (e.g., VP V or VP V NP) is a basic parameter • Combined probability of the entire tree gives the quality function • Forward-backward algorithm gives the solution • Lexicalization (Collins, 1996, 1997)
Example: Machine Translation • The noisy channel model (Brown et al., 1991) • Input in one language (e.g., English) is garbled into another (e.g., French) • Estimate probabilities of each word or phrase generating words or phrases in the other language and how many of them (fertility) • A similar approach: Transliteration (Knight, 1998)
Linear regression • Predict output as a linear combination of input variables • Choose weights that minimize the sum of residual square error (least squares) • Can be computed efficiently via a matrix decomposition and inversion
Log-linear regression • Ideal output is 0 or 1 • Because the distribution changes from normal to binomial, a transformed LS fit is not accurate • Solution: Use an intermediate predictor , • Can be approximated with iterative reweighted least squares
Examples • Text categorization for information retrieval (Yang, 1998) • Many types of sentence/word classification • cue words (Passonneau and Litman, 1993) • prosodic features (Pan and McKeown, 1999)
Singular-value decomposition • A technique for reducing dimensionality; data points are projected • Given matrix A (nm), find matrices T (nk), S (kk), and D (km) so that their product is A • S is the top k singular values of A • Projection is achieved by multiplying and A • Application: Latent Semantic Indexing
Methods without an explicit probability model • Use empirical techniques to directly provide output without calculating a model • Decision trees: Each node is associated with a decision on one of the input features • The tree is built incrementally by choosing features with the most discriminatory power
Variations on decision trees • Shrinking to prevent over-training • Decision lists (Yarowsky 1997) use only the top feature for accent restoration
Rule induction • Similar to decision trees, but the rules are allowed to vary and contain different operators • Examples: RIPPER (Cohen 1996), transformation-based learning (Brill 1996), genetic algorithms (Siegel 1998)
Methods without explicit model • k-Nearest Neighbor classification • Neural networks • Genetic algorithms
Support vector machines • Find hyperplane that maximizes distance from support vectors • Non-linear transformation: From original space to separable space via kernel function • Text categorization (Joachims, 1997), OCR (Burges and Vapnik, 1996), Speech recognition (Schmidt, 1996)
Classification issues • Two or many classes • Classifier confidence, probability of membership in each class • Training / test set distributions • Balance of training data across classes
When to use each method? • Probabilistic models depend on distributional assumptions • Linear models (and SVD) assume a normal data distribution, and generalized linear models a Poisson, binomial, or negative binomial • Markov models capture limited dependencies • Rule-based models allow for multi-way classification easier than linear/log-linear ones • For many applications, it is important to get a confidence estimate