400 likes | 790 Views
Word Prediction. Words do not randomly appear in text. The probability of a word appearing in a text is to a large degree related to the words that have appeared before it. e.g. I’d like to make a collect ...
E N D
Word Prediction • Words do not randomly appear in text. The probability of a word appearing in a text is to a large degree related to the words that have appeared before it. • e.g. I’d like to make a collect ... • Call is the most likely next word, but other words such as telephone, international... are also possible. Other (very common) words are unlikely (e.g. dog, house).
Word Prediction • Word prediction is very useful for applications such as: • Speech recognition: It is possible to select between words that are hard for a speech recognizer to distinguish. • Augmentative communication for the disabled: Speech generation systems can become more effective. • Spelling error detection: • They are leaving in about 15 minuets. • He is trying to fine out. • Word prediction is also related to the problem of computing the probability of a sentence.
Counting Words in Corpora • Text corpus: A collection of text (or speech). • Brown corpus: 1M words of text. • Switchboard corpus: 3 million words of speech. • Word counting in corpora: • Punctuation may count as words or not. • Are “don’t”,”O’Reilly”,“non-carbonated” one or two words. • Are “They” and “they” different or the same word. • Many of these choices depend on the application.
Words in Corpora • Tokens: Total numbers of running words in corpus (possibly including punctuation). • Types: Number of distinct words in corpus. • Wordform: The inflected word as it appears in the corpus. e.g. cat =/= cats. • Lemma: Set of word formshaving the same stem and the same word sense. e.g. cat and cats are not distinguished. • Switchboard corpus: 2.4M tokens, 20,000 types • Brown corpus: 1M tokens, 61,805 types, 37,851 lemmata
Simple N-Grams • Probabilistic models of word sequence • Simplest model: Every word may follow any other word. All words have equal probability. • More complex: The probability of appearance of each word depends on its frequency in the corpus. e.g. • the appears 69,971 times in Brown corpus (7%) • rabbit appears 12 times (0.001%) • But suppose we have the sentence: • Just then the white...
Markov Assumption • The probability of the appearance of a word depends on the words that have appeared before it. P(rabbit | Just then the white) • Impossible to calculate this probability from a corpus. The exact word sequence would have to appear in the corpus. • Markov simplifying assumption: we approximate the probability of a word given all the previous words with the probability given only the previous word. P(rabbit | Just then the white)= P(rabbit | white) • First used by Markov (1913) to calculate if the upcoming letter in Pushkin’s Eugene Onegin would be a vowel or a consonant
Bigrams, Trigrams ... N-grams • When looking only in the previous word we call the model a bigram model (or first order Markov model). • When looking two words back we have a trigram model (or second order Markov model). • Generally, looking N-1 words back, we get an N-gram model (or N-1th order Markov model). • The probability of a sentence is calculated by multiplying the (approximated) probability of each word in the sentence. • e.g: P(I want to eat Chinese food)=P(I | <s>)P(want | I)P(to | want)P(eat | to)P(Chinese | eat)P(food | Chinese)
N-gram Probability Calculation • We count the appearances of a bigram in our training corpus and then we normalize by dividing by the number of appearances of the first word. P(Wn | Wn-1)=C(Wn-1 Wn) / C(Wn-1) • Problem with N-grams is that many valid N-grams will be missing from the finite training corpus. Many bigrams that should have same probability will have zero, or very small probability.
I: 3437 , want:1215 , to:3256 , eat: 938 , Chinese: 213 , food: 1506 , lunch: 459
Smoothing • Smoothing (or discounting) is the process of assigning some probability to zero or low probability N-grams. • Add-one smoothing is the adding of one appearance to every possible N-gram. This method doesn’t work well because it takes too much probability away from likely N-grams and shares it among many impossible N-grams. e.g. P(to | want) changes from 0.65 to 0.28
Smoothing • Witten-Bell Discounting: We use the count of N-grams we have only seen once to help us estimate the count of things we have not yet seen. We compute the probability of seeing a new N-gram for the first time by counting the N-grams in the corpus that have appeared at least once. This moves far less probability from likely to unlikely events and gives better models than the add-one smoothing.
Training and Test sets • The statistical parameters of a model are trained on a training corpus and then the quality of the model is tested on a test set. • Selection of training corpus is important. If it is too specific it might not generalize well to new sentences. If it is too general it may not be reflective of the domain of the application where we want to use it. • Selection of the test sentences is also critical. If the test sentence comes from the training corpus then the performance will be artificially high. If the domain of the test corpus is different than that of the training corpus then the statistical model will not reflect the properties of the test set. • Usually we have the same corpus and we divide it into a training set and a test set.
N-grams and Training Corpora • Example of artificially generated sentences showing the performance of N-grams for various N and the dependence of the model on the domain training corpus. • Works of Shakespeare • Unigram: Will rash been and by I the me loves gentle me not slavish page, the and hour, ill let • Bigram: Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow. • Trigram: Sweet Prince, Falstaff shall die. Harry of Monmouth’s grave. • Quadrigram: Enter Leonato’s brother Antonio, and the rest, but seek the weary beds of people sick.
N-grams and Training Corpora • Similar sentences generated from the Wall Street journal corpus. • Unigram: Months the my and issue of wear foreign new exchange’s September were recession exchange new endorsed a acquire to six executives. • Bigram: Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners ... • Trigram: They also point to ninety nine point six billion dollars from two hundrent oh six three per cent of the rates of interest ...
Context-Sensitive Spelling Error Correction • Spelling errors which result in real words cannot be found without looking at the context of a word. • According to empirical studies between 15% to 40% of spelling errors result in real words. • Local errors are errors that can be detected by looking at the immediate surrounding of the words. Global errors require looking at a larger context.
Context-Sensitive Spelling Error Correction • Generate every possible misspelling of a word that results in a real word. These are generated based on typographical transformations (letter insertion, deletion, substitution) or based on lists of homophones (e.g. piece – peace) • Then given all these possible words we select the word that maximizes the N-gram probability of the sentence that contains the word in question.
Part of Speech Tagging (POST) • POST is the process of assigning to each word in a text its part of speech or in general its word class . • Input: A text and a set of possible word classes (tagset). • Output: The text, where next to each input word its word class has been marked. A good post is often the first step for many applications such as: • Speech Generation and Recognition • Machine Translation • Information Retrieval • Computational Lexicography • Word sense disambiguation • Prepositional phrase attachment disambiguation
Word Classes and Tagsets • For English there is a number of different tagset differing in the number of word classes contained in each. • Brown Corpus tagset: 87 • Penn Treebank tagset: 45 • C5 (British National Corpus): 67 • C7: 146 • For Greek different tagsets have been used ranging from 58 to 584 word classes. Example of Sentence with POS annotation: The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
Word Classes Number of Words 1 WC 35340 2 WC 3760 3 WC 264 4 WC 61 5 WC 12 6 WC 2 7 WC 1 (still) Ambiguity in POST • Ambiguity is a serious problem for POST, i.e. a word may belong to more than one word classes. E.g. Book/VB that/DT flight/NN. Book: VB(verb) ή NN(noun) That: DET(determiner) ή IN (subordinative conjuction) • Many of the most commonly occurring words are ambiguous: only 11.5% of the lemmata in the Brown Corpus are ambiguous, but 40% of the word tokens are ambiguous.
POST Algorithms • Rule-Based: They use a large pool of hand written rules which are used for word class disambiguation. • Stochastic and Probabilistic: They use a training corpus to calculate the probability that a certain word belongs to a certain word class, depending on the context of the word in the text. • Transformation-based Learning (Brill algorithm): Combination of rule-based and stochastic approaches.
Rule-Based POST • Usually two-stages: • Use a dictionary or morphological analysis to assign to each word all its possible POS. • Use a list of hand-written disambiguation rules to select only one POS for each word. • ENGTWOL tagger: • First Stage: Uses a 2-level morphological analyses to find possible word classes for each word. Also uses some additional syntactic information (e.g. Verb subcategorisation) • Second Stage: Uses about 1,100 constraints to rule out incorrect POS
ENGTWOL Rules • Rules used to eliminate tags inconsistent with the context. E.g: ADVERBIAL-THAT RULE Given input: “that” If (+1 A/ADV/QUANT); // If the next word is adj, adverb or quantifier (+2 SENT-LIM); // and following is a sentence boundary (NOT –1 SVOC/A); // and the previous is not a verb that allows adj as //complements (e.g. consider) then eliminate non-ADV tags else eliminate ADV tag • This rule checks if the word “that” has an adverbial reading: e.g. This isn’t that odd.
Stochastic Part-of-Speech Tagging • The intuition is to pick the most-likely tag for each word. • We select the tag that maximizes the probability: P(word|tag)*P(tag|previous n tags) • Markov simplifying assumption: P(tag|previous n tags)=P(tag|previous tag) so we want to maximize P(word|tag)* P(tag|previous tag) • The probabilities P(word|tag) and P(tag1|tag2) have been calculated by counting occurrences of words and tags in annotated training corpora.
Stochastic POST Example • Find the POS of the word “race” in the sentence: Secreteriat is expected to race tomorrow. • Is race a noun (NN) or a verb (VB). Assuming we know that the correct tagging for to is TO we have to calculate: P(VB|TO)P(race|VB) and P(NN|TO)P(race|VB) • From frequency counts of the Brown and the Switchboard corpus we have: P(NN|TO) = .021, P(VB|TO) = .34, P(race|NN)=.00041, P(race|VB)=.00003 So, P(VB|TO)P(race|VB)=.00001 > P(NN|TO)P(race|NN)=.000007 • To tag an entire sentence instead of just one word, we use a Hidden Markov Model
Transformation-Based POST • Like rule-based systems it is using rules to decide which word class will be selected for each word. However, like stochastic systems, it learns these rules from a preannotated training corpus. • An ordered list of transformations is kept by the algorithm. These rules are applied on the text we want to tag. Each transformation is composed of a rewrite rule and an activation condition. e.g. • Rewrite Rule: Change the tagging from MD (modal) to NN (noun) • Activation Condition: The tagging of the previous word is DT (determiner) • The/DT can/MD rusted/VB ./. The/DT can/ΝΝ rusted/VB ./. .
Learning of Transformation Rules • Initially the training text is given as input without any tagging and an initial tagging is performed. e.g. most likely tags. • In each training stage a number of possible transformations (based on certain patterns) is performed on the text Of these transformations we select the transformation that produced the annotation that is closest to the correct annotation of the training text. • The training continues iteratively until no transformation can be found that causes any improvement in the tagging of the text.
Allowed Transformations • The transformations may be lexicalized or not. Non Lexicalized: Change the tagging of a word from A to B if: • The previous word is tagged as Z • One of the two following words is tagged as Z • The previous word is tagged as Z and the next as W Lexicalized: Change the tagging of a word from A to B if: • The previous word is w • One of the two following words is w • The previous word is w, the next word is x and the previous word is tagged as Z
Accuracy of Tagging • Factors affecting accuracy: • Amount of training data available. • Tag set. • Difference between the training corpus and dictionary and the corpus of application. • Number of unknown words (coverage of the dictionary). • A ‘dumb’ tagger that always assigns the most common part of speech can get accuracy of about 90% (for English). Most taggers report accuracies between 95-97%. • These number can be somewhat misleading. 97% means probability of 63% to get every word in a 15 word sentence correct. When POS is just preprocessing for other applications, very high accuracy is required.
Collocations • A collocation is an expression of two or more words that correspond to a conventional way of saying things. e.g. strong tea, weapons of mass destruction, make up … • Various definitions. e.g. (Choueka, 1988) • A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. • Adjacency of words is not always necessary (e.g. knock...door).
Applications of Collocations • Language Generation: To make the generated language sound natural. • Computational Lexicography: To identify important collocations for listing in dictionaries. • Parsing: For preferring parses containing natural collocations. • Corpus Linguistic Research: For studying the social phenomena as they are expressed in the language.
Criteria for Collocations • Non-compositionality: The meaning of a collocation is not a straightforward composition of the meanings of its parts. e.g. white in white wine, white hair and white woman refers to somewhat different colors depending on the collocation. • Non-substitutability: We cannot substitute other words for the components of a collocation. E.g. we cannot say yellow wine instead of white wine. • Non-modifiability: Most collocation cannot be freely modified with additional lexical material. E.g. we cannot say kicked the wooden bucket.
Discovering Collocations • We can search for collocations based on their frequency in a corpus 80.871 of the 13.689 of a 58.841 in the 13.361 by the 26.430 to the 13.183 with the 21.842 on the 12.622 from the 21.839 for the 11.428 New York 18.568 and the 10.007 he said 16.121 that the 9.775 as a 15.630 at the 9.231 is a 15.494 to be 8.753 has been 13.899 in a 8.573 for a Most common bigrams from the New York Times corpus (14M words)
Discovering Collocations • But such bigrams are not really collocations. A better approach is to filter based on part-of-speech. 11.487 New York A-N 7.261 United States A-N 5.412 Los Angeles N-N 3.301 last year A-N 3.191 Saudi Arabia N-N 2.699 last week A-N 2.514 vice president A-N
Mean and Variance • Often collocations are not consecutive words. E.g knock and door in • She knocked on his door • A man knocked on the metal front door • They knocked at the door • We need to count frequencies of co-occurrence of the words in a window (usually 3 to 4 words to the side of each word). • We can also calculate the mean and the variance of the distances between the words. Low variance means that the words have a relatively fixed structural relationship (i.e. they usually occur at the same distance)
Hypothesis Testing • Based on frequency it is possible that coocurrance of two frequently occurring words might be accidental.Hypothesis testing is a method for testing if this co-occurrence is accidental. We formulate a null hypothesis that these co-occurrences are accidental and compute the probability p of the words occurring together if the hypothesis is true. If p is too low we reject the hypothesis (i.e. the words are related). • Various test have been proposed in the statistics literature for determining the probability of the words occurring together: t test, testing of differences, Person’s chi-test, likelihood ration. • Mutual Information is a measure of word relation based on information theory. It tells us how much the information we have about the occurrence of a word increases when we have information about the occurrence of another word.