Handling of missing values in lexical acquisition Núria Bel Universitat Pompeu Fabra

Handling of missing values in lexical acquisition Núria Bel Universitat Pompeu Fabra

By Automatic Lexical Information Acquisition we .. try to find how to build repositories of language dependent lexical information automatically. Many technologies behind applications (MT, IE, Automatic Summarization, Sentiment Analysis, Opinion Mining, Question Answering, etc.) do need this information to work ("paralelo" AST ALO "paralel" ATR POST CL (PF-AS PM-OS SF-A SM-O) FC (NPP) LY AMENTE MC ("a") PLC (NG) PRED (ESTAR SER) TA (OBJ-P REL) AUTHOR "juan" DATE "31-Aug-99" SITE "FB52") ("fiesta" NST ALO "fiest" CL (PF-AS SF-A) GD (F) KN MS PLC (NF) TYN (ABS) AUTHOR "juan" DATE "28-Aug-99" SITE "FB52") Entries borrowed from MT system Incyta (Metal family)

Cue Based Lexical Acquisition • Differences in the distribution of certain contexts separate words of different classes (Harris, 1951). • For example: some / *many mud • Words (types) can be represented in terms of a collection of contexts where their occurrence or not in these contexts is taken as hints or cues for a word to be classified as being of a particular class.

Word’s occurrences are represented as vectors and used to train a classifier. @data 15,2,8,4,0,8,1,0,1,0,0,0,0,0 Number of times the word has been observed in each of the defined contexts. Non occurrence in particular contexts is as informative as occurrence. We use supervised classifiers (Support Verb Machines, Decision Trees) to predict the class (Abstract, Mass, etc.) of new words.

Cues, classification and state-of-the-art results • Merlo and Stevenson (2001) selected very specific cues for classifying verbs into a number of Levin (1993) based verbal classes: animacy of the subject, passives, ... • Baldwin (2005) used general features, such as the pos tags of neighboring words for type classification. • Joanis et al. (2007) used the frequency of filled syntactic positions or slots, tense and voice of occurring verbs, etc., to describe the whole system of English verbal classes. • Difficult to compare the results, but .. an accuracy of about 70%

The problem: missing values

The Sparse data problem • Joanis and Stevenson, 2003; Joanis et al. 2007; Korhonen et al. 2008 mention that they have to face the problem of sparse data, many of the types/words are low in frequency and show up very little information. • Most of the words will appear very little (i.e. Zipff distribution) and therefore will show few cues. • Yallop et al. (2005) calculated that in the 100M-word British National Corpus, from a total of 124,120 distinct adjectives, 70,246 occur only once. • The cues we can use as information are mutually exclusive, i.e. an adjective can be prenominal and postnominal, but if it only occurs once, it will only show one cue, the other ones being a zero value. • Even when appearing more frequently, the optional nature and variety of the contexts of occurrence are the origin of missing values also for those types that occur more than once.

Zero values and learning • Zero values create not only a problem of enough information to decide, but a further uncertainty when learning from the data. • A zero value could be indeed a negative value, i.e. the cue is that it has not been observed, but it could be that the cue was just not observed in the examined corpus because of various reasons • When there are many zero values, the cue loses its predictive power because of the mentioned uncertainty. • Katz (1987) and Baayen and Sproat (1996), among others, acknowledged the importance of preprocessing low frequency events and Joanis et al. (2007) also decided to smooth the data, even working with more than 1000 occurrences per verb in the BNC.

Our smoothing experiment: Harmonization based on linguistic information

Intuitively: How likely is that a 0 is just an unobserved feature and not a true 0, given the values of other observations? To classify Abstract/Concrete nouns in English: Cue 1 is “suffix “–ness”, “-ism”, …. For Abstracts (Light 1996) Cue 2 is “determiners “such”, “little”, much” .. For Abstracts Cue 3 is “adjectives like “big”, “small”, … For Concrete P(cue_1=1|[0,1,0]) = P(abstract=yes|[0,1,0])* P(cue_1=1|abstract=yes) + P(abstract=no|[0,1,0]) * P(cue_1=1|abstract=no)

We use the information of observed features to assess the likelihood of a particular unobserved cue. • Harmonization is substituting 0 values by the likelihood of being 1 given the other cues observed. • BUT … In order to get P(cue_1=1|[0,1,0]) we need to have P(cue_n|class) and for all cues in the vector.

The challenge: how to get P(cue_n|class) with so many 0’s in the data… ?By estimating the P(cue_n|class) with linguistic information AbstractConcrete Suffix=no 0.5 1.0 Suffix=yes 0.5 0.0 SC_Adj=no 1.0 0.5 SC_Adj=yes 0.0 0.5 “The probability of being Concrete and having suffix “ness” is 0”

Harmonization effects in Spanish Mass experiment

Results of the experiments Spanish Mass English Abstract Experiment DT SVM DT SVM Mean 74.2 63.8 57.8 61.0 Trimmed mead 77.5 67.4 55.6 61.0 Frequency 79.9 79.1 61.4 64.1 Harmonized 82.8 80.7 76.1 70.1 Baseline 74.8 61.5

Error Analysis & Future work • Frequency information to filter noise has been neutralized • Future work is about how to handle missing values and noise together.

Thanks for your attention !

Handling of missing values in lexical acquisition Núria Bel Universitat Pompeu Fabra