370 likes | 506 Views
Word Prediction in Hebrew Preliminary and Surprising Results. Yael Netzer Meni Adler Michael Elhadad Department of Computer Science Ben Gurion University, Israel. Outline . Objectives and example. Methods of Word Prediction Hebrew Morphology Experiments and Results Conclusions?.
E N D
Word Prediction in HebrewPreliminary and Surprising Results Yael Netzer Meni Adler Michael Elhadad Department of Computer Science Ben Gurion University, Israel
Outline • Objectives and example. • Methods of Word Prediction • Hebrew Morphology • Experiments and Results • Conclusions? Outline
Word Prediction - Objectives • Ease word insertion in textual software • by guessing the next word • by giving a list of possible options for the next word • by completing a word given a prefix • General idea: guess the next word given the previous ones [Input w1 w2] [guess w3] Objectives
(Example) I s_____ Word Prediction Example
(Example) I s_____ verb, adverb? Word Prediction Example
(Example) I s_____ verb sang? maybe. singularized? hopefully Word Prediction Example
(Example) I saw a _____ Word Prediction Example
(Example) I saw a _____ noun / adjective Word Prediction Example
(Example) I saw a b____ Word Prediction Example
(Example) I saw a b____ brown? big? bear? barometer? Word Prediction Example
(Example) I saw a bird in the _____ Word Prediction Example
(Example) I saw a bird in the _____ [semantics will do good] Word Prediction Example
(Example) I saw a bird in the z____ Word Prediction Example
(Example) I saw a bird in the z____ obvious (?) Word Prediction Example
Statistical Methods • Statistical information • Unigrams:probability of isolated words • Independent of context, offer the most likely words as candidates • More complex language models (Markov Models) • Given w1..wn, determine most likely candidate for wn+1 • Most common method in applications is the unigram (see references in [Garay-Vitoria and Abascal, 2004]) Word Prediction Methods
Syntactic Methods • Syntactic knowledge • Consider sequences of part of speech tags[Article] [Noun] predict [Verb] • Phrase structure[Noun Phrase] predict [Verb] • Syntactic knowledge can be statistical or based on hand-coded rules Word Prediction Methods
Semantic Methods • Semantic knowledge • Assign semantic categories to words • Find a set of rules which constrain the possible candidates for the next word • [eat verb] predict [word of category food] • Not widely used in word prediction, mostly because it requires complex hand coding and is too inefficient for real-time operation Word Prediction Methods
Word Prediction Knowledge Sources • Corpora: texts and frequencies • Vocabularies (Can be domain specific) • Lexicons with syntactic and/or semantic knowledge • User’s history • Morphological analyzers • Unknown words models Word Prediction Methods
Evaluation of Word Prediction • Keystroke savings • Time savings • Overall satisfaction • Cognitive overload (length of choice list vs. accuracy). • A predictor is considered adequate if its hit ratio is high as the required number of selections decreases. 1-(# of actual keystrokes/# of expected keystrokes) Word Prediction Evaluation
Work in non-English Languages • Languages with rich morphology: • n-gram-based methods offer quite reasonable prediction [Trost et al. 2005] but can be improved with more sophisticated syntactic/semantic tools • Suggestions for inflected languages (e.g. Basque) • Use two lexicons: stems and suffixes • Add syntactic information to dictionaries and grammatical rules to the system, offer stems and suffixes • Combine these two approaches: offer inflected nouns. Hebrew Word Prediction
Motivation for Hebrew • We need word prediction for Hebrew • No known previous published research for Hebrew. • We wanted to test our morphological analyzer in a useful application. Hebrew
Initial Hypothesis Word prediction in Hebrew will be complicated, morphological and syntactic knowledge will be needed.
Hebrew Ambiguity • Unvocalized writing: most vowels are “dropped” inherent inhrnt • Affixation: prepositions and possessives are attached to nouns in her note inhrnt in her net inhrnt • Rich Morphology • ‘inhrnt’ could be inflected into different forms according to sing/pl, masc/fem properties. inhrnti, inhrntit, inhrntiot • Other morphological properties may leave ‘inherent’ unmodified (construct/absolute forms for noun compounding). Hebrew
Ambiguity Level • These variations create a high level of ambiguity: • English lexicon: inherent inherent.adj • With Hebrew word formation rules:inhrnt in.prepher.pro.fem.possnote.noun in.prepher.pro.femnet.noun inherent.adj.masc.absolute inherent.adj.masc.construct • Parts of speech tagset: • Hebrew: Theoretically: ~300K, In practice: ~3.6K distinct forms • English: 45-195 tags • Number of possible morphological analyses per word: • English: 1.4 (Average # words / sentence: 12) • Hebrew: 2.7 (Average # words / sentence: 18) Hebrew
(Real Hebrew) Morphological Ambiguity • בצלם bzlm • בְּצֶלֶם bzelem (name of an association) • בְּצַלֵּם b-zalem (while taking a picture) • בְּצָלָם bzalam (their onion) • בְּצִלָּם b-zila-m (under their shades) • בְּצַלָּם b-zalam (in a photographer) • בַּצַּלָּם ba-zalam (in the photographer( • בְּצֶלֶם b-zelem (in an idol( • בַּצֶּלֶם ba-zelem (in the idol( Hebrew Morphology
Morphological Analysis Given a written form, recover the following information: • Lexical category (part-of-speech) • noun, verb adjective, adverb, preposition… • Inflectional properties • gender, number, person, tense, status… • Affixes • Prefixes: מ ש ה ו כ ל ב (prepositions, conjunctions, definiteness) • Pronoun suffix: accusative, possessive, nominative Hebrew Morphology
Morphological Analysis Example: given the form בצלם propose the following analyses: • בְּצֶלֶם • בצלם proper-noun • בְּצַלֵּם • בצלם verb, infinitive • בְּצָלָם • בצל-ם noun, singular, masculine • בְּצִלָּם • ב-צל-ם noun, singular, masculine • בְּצַלָּםבְּצֶלֶם • ב-צלם noun, singular, masculine, absolute • ב-צלם noun, singular, masculine, construct • בַּצַּלָּםבַּצֶּלֶם • ב-צלם noun, definitive singular, masculine Hebrew Morphology
Morphological Disambiguation A difficult task in Hebrew: Given a written form, select in context the correct morphological analysis out of all possible analyses. We have developed a successful* system to perform morphological disambiguation in Hebrew [Adler et al, ACL06, ACL07, ACL08]. *93% for POS tagging and 90% for full morphology analysis, which was used in this test) Hebrew Morphology
Word Prediction in Hebrew • We looked at Word Prediction as a sample task to show off the quality of our Morphological Disambiguator • But first… we checked a simple baseline Hebrew Word Prediction
Baseline: n-gram methods • Check n-gram methods (unigram, bigram, trigram) • Four sizes of selection menus: 1, 5, 7 and 9 • Various training sets of 1M, 10M and 27M words to learn the probabilities of n-grams. • Various genres. Hebrew Word Prediction
Prediction results using n-grams only Keystrokes needed to enter a message in % (Smaller is better) For tri-grams model trained on 27M corpus – very good results! Hebrew Word Prediction
Adding Syntactic Information P(wn|w1,…,wn-1) = λ1P(wn-i,…,wn|LM) + λ2P(w1,…,wn|μ), • μ is the morpho-syntactic HMM (morphological disambiguator) • Combine P(w1,…,wn|μ) with the probabilistic language model LM in order to rank each word candidate given previous typed words. • if the user typed I saw, and the next word candidates are {him, hammer} we use the HMM model, for calculating: p(I saw him|μ) p(I saw hammer|μ), in order to tune the probability given by the n-gram. * Trained on a 1M sized corpus. Hebrew Word Prediction
Results with morpho-syntactic knowledge Model sequences of parts of speech with morphological features Results w/o syntactic knowledge Hebrew Word Prediction
Some Notes on Results • n-grams perform very well (high level of keystroke saving) • High rate for all genres • And the expected: • Better prediction when trained on more data • Better prediction with tri-grams • Better prediction with larger window • Morpho-syntactic information did not improve results (in fact, it hurt!) Results
Conclusion • Statistical data on a language with rich morphology yields good results • up to 29% with nine word proposals • 34% for seven proposals • 54% for a single proposal • Syntactic information did not improve the prediction. • Explanation - morphology didn't improve due the use of p(w1,…,wn|μ) of an unfinished sentence Hebrew Word Prediction - Conclusions
תודה Thank you
Technical Information • CMU – N-grams • Storage – Berkeley DB to store knowledge for WP: Mapping n-grams • More questions on technology – meni.adler@gmail.com Hebrew Word Prediction