Assignments

Assignments • Basic idea is to choose a topic of your own, or to take a study found in the literature • Report is in two parts • Description of problem and Review of relevant literature (not just the study you are going to replicate, but related things too) • Description and discussion of your own results • First part (1000-1500 words) due in Friday 25 April • Second part (1500-2000 words) due in Friday 9 May • No overlap allowed with LELA30122 projects • Though you are free to use that list of topics for inspiration • See LELA30122 WebCT page, “project report”

Church et al. 1991 K Church, W Gale, P Hanks, D Hindle (1991) Using Statistics in Lexical Analysis, in U Zernik (ed) Lexical Acquisition: Exploiting on-line resources to build a lexicon. Hillsdale NJ (1991): Lawrence Erlbaum, pp. 115-164.

Background • Corpora were becoming more widespread and bigger • Computers becoming more powerful • But tools for handling them still relatively primitive • Use of corpora for lexicology • Written for the First International Workshop on Lexical Acquisition, Detroit 1989 • In fact there was no “Second IWLA” • But this paper (and others in the collection) become much cited and well known

The problem • Assuming a lexicographer has at their disposal a reference corpus of considerable size, … • A typical concordance listing only works well with • words with just two or three major sense divisions • preferably well distinct • and generating only a pageful of hits • Even then, the information you may be interested in may not be in the immediate vicinity

The solution • Information Retrieval faces a comparable problem (overwhelming data), and suggests a solution • Choose an appropriate statistic to highlight information “hidden” in the corpus • Preprocess the corpus to highlight properties of interest • Select an appropriate unit of text to constrain the information extracted

Mutual Information • MI: a measure of similarity • Compares the joint probability of observing two words together with the probabilities of observing them independently (chance) • If there is a genuine association, I(x;y)>>0 • If no association, P(x,y)  P(x)P(y), I(x;y)  0 • If complementary distribution, I(x;y)<<0

Top ten scoring pairs of strong y and powerful y Data from AP corpus, N=44.3m words

Mutual Information • Can be used to demonstrate a strong association • Counts can be based on immediate neighbourhood, as in previous slide, or on co-occurrence within a window (to left or right or both), or within same sentence, paragraph, etc. • MI shows strongly associated word pairs, but cannot show the difference between, eg strong and powerful

t-test • A measure of dissimilarity • How to explain relative strength of collocations such as • strong tea ~ powerful tea • powerful car ~ strong car • The less usual combination is either rejected, or has a marked contrastive meaning • Use example of {strong|powerful} support because tea rather infrequent in AP corpus

{strong|powerful} support • MI can’t help: very difficult to get value for I(powerful;support)<<0 because of size of corpus • Say x and y both occur about 10 times per 1m words in a corpus • P(x) = P(y) = 10-5 and chance P(x)P(y) = 10-10 • I(powerful;support)<<0 means P(x)P(y) << 10-10 • iemuch less than 1 in 10,000,000,000 • Hard to say with confidence

Rephrase the question • Can’t ask “what doesn’t collocate with powerful?” • Also, can’t show that powerful support is less likely than chance: in fact it isn’t • I(powerful;support)=1.74 • 3 x greater than chance! • Try to compare what words are more likely to appear after strong than after powerful • Show that strong support relatively more likely than powerful support

t-test • Null hypothesis (H0) • H0 says that there is no significant difference between the scores • H0 can be rejected if • Difference of at least 1.65 sd’s • 95% confidence • ie the difference is real

t-test • Comparison of powerful support with chance is not significant • t = 0.99 (less than 1 sd!) • But if we compare powerful support with strong support, t = –13 • Strongly suggests there is a difference

MI and t-score show different things

How is this useful? • Helps lexicographers recognize significant patters • Especially useful for learners’ dictionaries to make explicit the difference in distribution between near synonyms • eg what is the difference between a strong nation and a powerful nation? • Strong as in strong defense, strong economy, strong growth • Powerful as in powerful posts, powerful figure, powerful presidency

Taking advantage of POS tags • Looking at context in terms of POS rather than lexical items may be more informative • Example, how can we distinguish to as an infinitive marker from to as a preposition? • Look at words which immediately precede to • able to, began to, … vs back to, according to, … • t-score can show that they have a different distribution

Similar investigation with subordinate conjunction that (fact that, say that, that the, that he) and demonstrative pronoun that (that of, that is, in that, to that) • Look at both preceding and following word • Distribution is so distinctive that this process can help us to spot tagging errors

subordinate conjunction demonstrative pronoun t w that/cs w that/dt w t w that/cs w that/dt w 14.19 227 2 so/cs –12.25 1 151 of/in

If your corpus is parsed • Looking for word sequences can be limiting • More useful if you can extract things like subjects and objects of verbs • (Can be done to some extent by specifying POS tags within a window, but that’s very noisy) • Assuming you can easily extract, eg Ss, Vs, and Os …

What kinds of things do boats do?

What is an appropriate unit of text? • Mostly we have looked at neighbouring words, or words within a defined context • Bigger discourse units can also provide useful information • eg taking entire text as the unit: • How do stories that mention food differ from stories that mention water?

More subtle distinctions can be brought out in this way • What’s the difference between a boat and a ship? • Notice how immediately neighbouring words won’t necessarily tell much of a story • But words found in stories that mention boats/ships help to characterize the difference in distribution, and give a clue as to the difference in meaning • Notice that human lexicographer still has to interpret the data

Word-sense disambiguation • The article also shows how you can distinguish two senses of bank • Identify words which occur in the same text as bankandriver on the one hand, and bank and money on the other

t bank&river bank&money w 6.63 45 4 river 4.90 28 13 River 4.01 20 13 water 3.57 16 11 feet 3.46 23 39 miles 3.44 21 32 near 3.27 12 5 boat 3.06 14 16 south 2.83 8 1 fisherman 2.83 21 49 along 2.76 11 12 border 2.74 17 35 area 2.72 9 6 village 2.71 7 0 drinking 2.70 16 32 across 2.66 9 7 east 2.58 7 2 century 2.53 10 13 missing bank (river) vs bank (money) t bank&river bank&money w -15.95 6 467 money -10.70 2 199 Bank -10.60 0 134 funds -10.46 0 131 billion -10.13 0 124 Washington -10.13 0 124 Federal - 9.43 0 110 cash - 9.03 1 134 interest - 8.79 1 129 financial - 8.79 0 98 Corp - 8.38 1 121 loans - 8.17 0 87 loan - 7.57 0 77 amount - 7.44 0 75 fund - 7.38 1 102 William - 7.31 1 101 company - 7.25 1 101 account - 7.25 0 72 deposits

Bank vs bank Bank bank t Bank bank w t Bank bank w 35.02 1324 24 Gaza -36.48 1284 3362 bank 34.03 1301 36 Palestinian -10.93 900 1161 money 33.60 1316 48 Israeli -10.43 624 859 federal 33.18 1206 26 Strip - 9.59 586 786 company 32.98 1204 29 Palestinians - 8.47 282 430 accounts 32.68 1339 72 Israel - 8.26 544 693 central 31.56 4116 1284 Bank - 8.21 408 554 cash 31.13 1151 47 occupied - 8.21 675 816 business 30.79 1104 40 Arab - 7.74 546 676 loans 27.97 867 21 territories - 7.54 52 140 robbery

Assignments

Assignments

Presentation Transcript

Assignments

Assignments

Assignments

Assignments

Assignments:

Assignments

Assignments

Assignments

Assignments

Assignments

Assignments

Assignments

Assignments

Assignments

Assignments

ASSIGNMENTS

ASSIGNMENTS

Assignments

Assignments

Assignments