1 / 64

Association Measures

Association Measures. Reminder: Contingency Tables. General Remarks. we will only use data from contingency tables we will consider each pair type on its own, independently from all other pair types (  no distributional information)

kareem
Download Presentation

Association Measures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Association Measures

  2. Reminder: Contingency Tables

  3. General Remarks • we will only use data from contingency tables • we will consider each pair typeon its own, independently from all other pair types( no distributional information) • we won't distinguish between relational and positional cooccurrences

  4. Association Measures (AMs) • goal: assign association score to each pair type = strength of association between components • high score = strong association • association in a statistical sense,but there is no precise definition • positive vs. negative association("colourless green ideas")

  5. Using Association Scores • absolute values (cut-off threshold) • input forhigher-order statistics(AMs are first-order statistics) scores should be meaningful • ranking of collocation candidates only relative scores matter • rank collocates of given base one marginal frequency fixed  only two free parameters

  6. First Steps: Proportions • Workshop on Mechanized Documentation (Washington, 1964)

  7. First Steps: Proportions • proportions between 0 and 1 • high proportion = strong (directional) association • need to combine two proportions into a single association score • average (P1 + P2) / 2 is not useful • f=1, f1=1, f2=1000 avg.=0.5005 • f=50, f1=100, f2=100  avg.=0.5  more "conservative" weighting

  8. First Steps: Proportions • harmonic mean • geometric mean • minimum • Jaccard

  9. First Steps: Proportions • coefficients range from 0 to 1 • 1 = total (positive) association • interpretation of lower scoresis less clear • positive vs. negative association? • which score for no association? • what is "no association"?? random combinations

  10. Expected Frequencies • assume that types u and v cooccur only by chance • f1(u) occs. of u and f2(v) occs. of v spread randomly over N tokens • each instance of u has a chance of f2(v)/N to cooccur with a v  expected # of cooccurrences:

  11. Expected Frequencies • expected frequencies for all cells of the contingency table • assuming random combinations( statistical independence)

  12. Expected Frequencies • comparison of expected against observed frequencies • note that row and column sums are the same for both tables

  13. Mutual Information • compares O11 with E11 • ratio O11/E11 ranges from 0 to  • 1 = no association (O11=E11) • usually logarithmic values • range: - to + • 0 = no assoc., < 0 neg., > 0 pos. • used in English lexicography

  14. Low-Frequency Pairs & Random Variation • large amount of low-frequency data (consequence of Zipf's law) • a simple (invented) example • A:f=50, f1=100, f2=100, N=1000 O11=50, E11=10,MI = log 5 • B:f=1, f1=1, f2=1, N=1000 O11=1, E11=.001, MI = log 1000

  15. Low-Frequency Pairs & Random Variation • three problems with case B • how meaningful is a single example? (not very much, actually) • could well be a spelling mistake or noise from automatic processing • we want to make generalisations (from particular corpus to "language")  this is the domain of statistics:draw inferences about population (=language) from a sample (=corpus)

  16. The Statistical Model:Random Sample • assumption: corpus data is a random sample from the language  base data is a random sample from all coocs. in the language

  17. The Statistical Model:Random Sample • random sample of size N is described by random variablesUi and Vi (i = 1..N), representing the labels of the i-th bigram token • notation: U and V as "prototypes" • for a given pair type (u,v), contingency table can becomputed from Ui and Vi  random variablesX11, X12, X21, X22

  18. The Statistical Model:Random Sample • population parameters11, 12, 21, 22 for pair type (u,v) • observed frequenciesO11, O12, O21, O22 represent one particular realisation of the sample • theory of random samples predicts distribution of X11, X12, X21, X22 from assumptions about the population parameters 11, 12, 21, 22

  19. The Statistical Model:Random Sample

  20. Two Footnotes • vector notation for cont. tables • population  general language • restricted to domain(s), genre(s), ...covered by source corpus • e.g. black box in computer science vs. newspapers vs. cooking

  21. The Sampling Distribution • multinomial sampling distribution • each individual cell count Xij has a binomial distribution (but these are not independent)

  22. The Sampling Distribution • given assumptions about the population parameters, we can compute the likelihood of the observed contingency table • relatively high likelihood= consistent with assumptions • relatively low likelihood= evidence against assumptions(inversely proportional to likelihood)

  23. Adequacy of the Statistical Model • particular sequence of pair tokens is irrelevant, only the overall frequencies matter ( sufficiency) • randomness assumption (random sample from fixed population) • independence of pair tokens • constancy of population parameters • violations problematic only when they affect sampling distribution

  24. Adequacy of the Statistical Model • three causes of non-randomness • local dependencies (e.g. syntax)  usually not problematic • inhomogeneity of source corpus(speakers, domains, topics, ...)  mixture population • repetition / clustering of bigrams  can be a serious problem(does not affect segment-based data if clustered within segments)

  25. Making Assumptions about the Population Parameters • population parameters (, 1, 2) are unknown • best guess from observation: MLE = maximum-likelihood estimate

  26. Making Assumptions about the Population Parameters • conditional probabilities with MLE • Dice coefficient etc. are MLE for population characteristics • MI is MLE for log( /(1  2))  unreliable for small frequencies

  27. The Null Hypothesis • null hypothesis H0: no association= independence of instances, i.e.P(U=u  V=v) = P(U=u)  P(V=v) • not all parameters determined • MLE maximise probability of observed data under H0

  28. Likelihood Measures • probability of observed data under H0 (with MLE) • probability of single cell: X11 should be most "informative"

  29. Likelihood Measures • small likelihood values = strong association • computed probabilities are often extremely small • use negative base-10 logarithm more convenient scale  high scores indicate strong association

  30. Problems of Likelihood Measures • three reasons for low likelihood • observed data is inconsistent with the null hypothesis because of strong association • association may also be negative (fewer coocs. than expected) • observed data is consistent, but probability mass is spread across many similar contingency tables

  31. Problems of Likelihood Measures • high frequency = low likelihood • e.g. binomial likelihood • O11=1, E11=1 L = 0.3679 • O11=1000, E11=1000 L = 0.0126 • O11=4, E11=1 L  0.0126 • need to "normalise" likelihood • NB: likelihood association measures often have good empirical results nonetheless

  32. Likelihood Ratios • simplest normalisation technique • divide maximum probability of data under H0 by unconstrained maximum probability • suggested by Dunning (1993)

  33. Statistical Hypothesis Tests • compute probability of group of outcomes instead of single one • observed contingency table is grouped with all tables that provide at least the same amount of evidence against H0 • total probability is known as the p-value or significance • problem: ranking of cont. tables

  34. Asymptotic Tests • asymptotic tests defined ranking of contingency tables explicitly • compute test statistic from data • higher values = more evidence against H0 • can use test statistic as an AM • theory: approximation of p-value associated with test statistic(accurate in the limit N  )

  35. Asymptotic Tests • standard test for independence is Pearson's chi-squared test • limiting distribution = 2 distribution with df=1 • number of degrees of freedom was subject of a long debate

  36. Two-Sided Tests • chi-squared test is two-sided, i.e. no difference between positive and negative association • ignore small number of pairs with (non-total) negative association • or convert to one-sided test:reject H0 only when O11 > E11 • p-value is usually divided by 2

  37. Yates Continuity Correction • Pearson's chi-squared test approximates discrete binomial distributions of each cell by continuous normal distribution( "normal theory") • estimating probabilities P(Xij  k) from normal distribution introduces systematic errors

  38. Yates' Continuity Correction

  39. Yates' Continuity Correction

  40. Yates' Continuity Correction • generic form of Yates' continuity correction for contingency tables • usefulness is still controversial (criticised as too conservative) • applicability for chi-squared test is generally accepted

  41. Asymptotic Tests • different form of chi-squared test (comparison of two binomials) is equivalent to independence test • special eq. with Yates' correction

  42. Asymptotic Tests • can also use log-likelihood ratio as a test statistic (two-sided) • limiting distribution is found to be 2 distribution with df=1 • more conservative than Pearson's chi-squared test • Dunning (1993) showed that Pearson's test over-estimates evidence against H0 (simulation)

  43. Something I'd Rather Not Mention • Church & Hanks: O11 and E11are both random variables • H0: expected values are equal • assume normal distribution with unknown variance • compare O11 and E11 with Student's t-test, estimating unknown variance from the observed data

  44. Something I'd Rather Not Mention • one-sided test • statistical model is questionable • limiting distribution: t-distribution with df  N • even more conservative than log-likelihood (low-frequency data)

  45. Exact Tests • problem: how to establish ranking of contingency tables • solution: reduce set of alternatives • if we consider only the cell X11,the difference X11 – E11 gives a sensible ranking: binomial test

  46. Exact Tests • another solution: marginal frequencies do not provide evidence for or against H0( "ancillary" statistics) • condition on fixed row and column sums R1, R2, C1, C2 • conditional hypergeometric distribution does not depend on parameters 1 and 2

  47. Exact Tests • X11 is the only free parameter • we can use X11 – E11 for ranking • Fisher's exact test (Pedersen 1996) • computationally expensive • numerical difficulties

  48. Comparing Hypothesis Tests • Fisher's test is now widely accepted as most appropriate • tends to be conservative • log-likelihood gives good approximation of "correct" p-values(slightly less conservative) • chi-squared over-estimates • t-score far too conservative

  49. Other Approaches to Measuring Association • information-theoretic (MI, entropy) equivalent to log-likelihood • combined measures ("boosting") • conservative estimates instead of MLE (confidence intervals) • hypothesis tests with different null hypothesis:  = C  1  2 • mixture of conservative estimates and hypothesis tests?

  50. Implementation • one-sided vs. two-sided tests • need special software to obtain p-values for asymptotic tests • numerical accuracy • beware of zero frequencies!

More Related