210 likes | 347 Views
Statistical Measures for Corpus Profiling. Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008. Contents. Why study differences between corpora? (Kilgarriff, 2001) Case Study in parsing (Sekine, 1997). Words and “countable linguistic features”.
E N D
Statistical Measures for Corpus Profiling Michael P. Oakes University of Sunderland Corpus Profiling Workshop, 2008.
Contents • Why study differences between corpora? (Kilgarriff, 2001) • Case Study in parsing (Sekine, 1997). • Words and “countable linguistic features”. • Overall differences between corpora and contributions of individual features: • Information theory • Chi-squared test • Factor Analysis • “Gold standard” comparison of measures (Kilgarriff, 2001).
Why study differences between corpora? • Kilgarriff (2001), “Comparing Corpora”, Int. J. Corpus Linguistics 6(1), pp. 97-133. • Taxonomise the field: how does a new corpus stand in relation to existing ones? • If an interesting finding is found for one corpus, for what other corpora does it hold? • Is a new corpus sufficiently different from ones you have already got to be worth acquiring? • Difficulty in porting a new corpus to an existing NLP system: time and cost are measurable.
Different Text Types • Englishes of the world, e.g. US vs. UK (Hofland and Johannson, 1982) • Social differentiation e.g. gender, age, social class (Rayson, Leech and Hodges 1997), diachronic, geographical location. • Stylometry, e.g. disputed authorship • Genre analysis, e.g. science fiction, e-shop (Santini, 2006) • Sentiment analysis (Westerveld, 2008). • Relevant vs. non-relevant documents? Probabilistic IR. • Statistical techniques exist to discriminate between these text types. Here the interest is in the types of language per se, rather than their amenability to NLP tools.
Words and countable linguistic features • Bits of words e.g. 2-grams (Kjell, 1994) • Words (many studies) • Linguistic features for Factor Analysis (Biber, 1995) e.g. questions, past participles. • Phrase rewrite rules (Sekine 1997, Baayen, van Halteren and Tweedie, 1996). • Any countable feature characteristic of one corpus as opposed to another. • Not hapax legomena, Semitisms in the New Testament.
Domain independence of parsing (Sekine, 1997) • Used 8 genres from the Brown Corpus, chosen to give equal amount of fiction (KLNP) and non-fiction (ABEJ). • Characterised domains by production rules which fire. • From this data produced a matrix of Cross Entropy of grammar across domains. • Then average linking of the domains based on the matrix of cross entropy gave intuitively reasonable results. • Evaluated (training / test) corpus difference on parser performance. • Discussed size of the training corpus.
Overall differences between corpora and contributions of individual features. • Vocabulary richness (e.g. type/token ratio, Yule’s K Characteristic, V2/N) is a characteristic of the entire corpus. Puts all corpora on a linear scale. • The techniques we will look at (chi-squared, information theoretic and factor analysis) can both give a value for the overall difference between two corpora, and quantify the contributions made by individual features.
Measures of Vocabulary Richness • Yule’s K characteristic: K = 10000 * (M2 -M1) / (M1 * M1); M1 = tokens; M2 = (V1 * 1²) + (V2 * 2²) + (V3 * 3²) … • Gerson 35.9, Kempis 59.7, De Imitatione Christi 84.2 • Heap’s Law: Vocabulary size as a function of text size, M = kT^b. Parameters k and b could discriminate texts, and allow them to be plotted in two dimensions. • Entropy is a form of vocabulary richness (but high individual contributions from both common and rare words).
The chi-squared test (Oakes and Farrow, 2006): (O - E)² / E values for three words in five balanced corpora (Σ (O-E)²/E = 414916.8)
Measures from Information Theory (Dagan et al., 1997) • Kullback Leibler (KL) divergence (also called relative entropy) used as a measure of semantic similarity by Dagan et al., 1997. • Meaning in coding theory • Problems: we get a value of infinity if there is a word with frequency 0 in corpus B and >0 in corpus A, and not symmetrical • Dagan (1997), Information Radius.
Information Radius • L (Fiction: detective) and P (Fiction: romance): 0.180 • A (Press reportage) and B (Press editorial): 0.257 • J (Academic prose) and P (Fiction: romance): 0.572
Factor Analysis • Decathlon analogy: running, jumping and throwing. • Biber (1988): groups of countable features which consistently co-occur in texts are said to define a “linguistic dimension”. • Such features are said to have positive loadings with respect to that dimension, but dimensions can also be defined by features which are in “complementary distributions”, i.e. negatively loaded. • Example: at one pole is “many pronouns and contractions”, near which lie conversational texts and panel discussions. At the other pole, “few dimensions and contractions” are scientific texts and fiction.
Evaluation of Measures (Kilgarriff 2001) • Reference corpus made up of known proportions of two corpora: 100% A, 0% B; 90% A, 10% B; 80% A, 20% B … • This gives a set of “gold standard” judgements: subcorpus 1 is more like subcorpus 2 than subcorpus 3, etc. • Compare machine ranking of corpora with the gold standard ranking using Spearman’s rank correlation coefficient.
Conclusions • Some measures allow comparisons of entire corpora, others enable the identification of typical features. • Different measure allow different kinds of maps: vocabulary richness allows ranking of corpora on a linear scale, Heap’s Law a 2D map of two parameters. Information theoretic measures give the (dis)similarity between two corpora – best viewed using clustering. With Factor Analysis, you don’t know what the dimensions are until you’ve done it. • Maps enable contours of application success.