130 likes | 389 Views
Word Association Norms, Mutual Information, and Lexicography. Kenneth Ward Church, Patrick Hanks Computational Linguistics March 1990. Abstract. Word association : (in psycholinguistics) - doctor … nurse (quicker response !) A statistical description of linguistic phenomena
E N D
Word Association Norms, Mutual Information, and Lexicography Kenneth Ward Church, Patrick Hanks Computational Linguistics March 1990
Abstract • Word association : (in psycholinguistics) • - doctor … nurse (quicker response !) • A statistical description of linguistic phenomena • - semantic relations : doctor and nurse • - lexico-syntactic constraints : verbs and prepositions • Mutual information, Association ratio • - an objective measure obtained from large corpora
Meaning and Association • Classification of words (in linguistics) : based on • - meanings • - co-occurrence with other words • ex) bank ---- money, notes, loan, account, … , of England • river, swim, boat , … , of the Rhine • A search for delicate word classes • - dates back to 1948 : verb patterns (Hornby’s A. L. dic.) • Recently available facilities • - computational storage, NLP for large scale text • - possible to see what company our words do keep !
Practical Applications • Constraining the language model • - speech recognition or OCR • ex) OCR assigned equal prob. to farm and form • … federal ( farm, form ) credit ... • … some ( farm, form ) of ... • Providing disambiguation cues for parsing • Retrieving texts from large database : IR • Enhancing the productivity of lexicographers
Word association and Psycolinguistics • Reaction-time experiment (in psycolinguistics, 1975) • - subjects classify successive letters as words or not • pronounce the letters • ex) BREAD --- BUTTER (faster) • NURSE --- BUTTER • Subjective method (1964) • - empirical estimates for word association norm asking a few thousands to write down doctor next word ? . . . . doctor : nurse,sick,health, ... (70 words) . . . 200 words
An Information theoretic measure (1) • Mutual Information : [Fano1961] • p(x) = f(x)/N , p(x,y) = fw(x,y)/N • N : size of corpus • (15mill ‘87 AP, 36mill ‘88 AP, 8.6mill tagged corpus) • w : window size(5) : [Table 1] • MI meaning between two words (x, y) • - genuine association : MI(x,y) >> 0 • - no relationship : MI 0 • - complementary distribution : MI << 0
An Information theoretic measure (2) • Association Ratio (AR) • - alternative measure of word association norms • - based on MI • - more objective, less costly than subjective method • - easy to scale up for a large portion of language • ex) doctor -- dentists, nurses, treating, treat, hospitals, ... • Association Ratio vs. MI • - joint prob. not symmetric : encodes linear precedence • ex) [Table 2] : asymmetry biases from sexism to syntax • - frequency counting method : considering window size
Characteristics of association ratio • Association Ratio • - large : the same effect as the subjective method [Table 3] • - AR threshold = 3.0 : in this paper • - rarely able to observe MI << 0 • ex) p(x) = p(y) = 10-5 , p(x)p(y) = 10-10 • MI << 0 <------- p(x,y) << 10-10 • in fact, can’t observe a prob. less than 10-7 • so, compensate for the window size ! [Table 3] • divide f(x,y) with window size
Lexico-syntactic regularities • Identifying phrasal verbs • ex) set up , set off , adhere to , … [Table 4] • Phrasal verbs involving “to” • - confused preposition “to” with the infinitive “to” • - preprocess tag associations using POS-tagged corpus ( 8.6 mill ) • ex) prep. “to” -- 768 verbs ( alluding, amounted, relating, … ) • infin. “to” -- 551 verbs ( obligated, trying, compelled, … ) • Associations between verbs and arguments • - preprocess the corpus ( 44 mill ) with a parser [Table 5,6] • - collect SOV triples ( 4 mill ) • - measure with association ratio
Applications in lexicography (1) • Large machine-readable corpora • - just recently availble • Computational tools • - still rather primitive : concordancing programs [Fig 1] • Lexicographers in 80’s • - given the concordances of a word • - mark up senses with colored pens • - writes syntactic descriptions and definitions • ex) take, save, from : thousands of concordance lines !!! • save : 666 lines from ‘88 AP corpus
Applications in lexicography (2) • Association between content/function words • - save ~ from : [Table 7] • Help categorize concordance lines : [Fig 2] • - save ~ from pattern : 65 lines from 666 • - how well 65 lines fit in with all uses of save ? • - Invented semantic tags can be suggested from AR • ex) save the forests [ENV] • save the lake [ENV] • save the planet [ENV] • - help to choose a set of semantic tags
Conclusions • Psyco-linguistic notion of word association • MI ---> association ratio (AR) • AR encodes very interesting patterns • - semantic relations : doctor ~ nurse • - lexico-syntactic constraints : save ~ from • AR help a lexicographer organize concordance lines • Weak points of Association Ratio • - only distributional evidence : semantic are compositional ! • ex) AR favors set ~ for over set ~ down • - extremely superficial • natural similarities : picture ~ photograph • cluster words into syntactic classes without tagger, parser