250 likes | 374 Views
Measuring Distance between Language Varieties. Adam Kilgarriff, Jan Pomikalek , Pavel Rychly , Vit Suchomel Supported by EU Project PRESEMT. How to compare language varieties. Qualitative Quantitative Quantitative means corpus Corpus represents variety Compare corpora. My big question.
E N D
Measuring Distance between Language Varieties Adam Kilgarriff, JanPomikalek, PavelRychly, VitSuchomel Supported by EU Project PRESEMT
Kilgarriff: Measuring How to compare language varieties • Qualitative • Quantitative • Quantitative means corpus • Corpus represents variety • Compare corpora
Kilgarriff: Measuring My big question • How to compare corpora • How else can corpus methods/corpus linguistics be scientific • Roles • How do varieties contrast • How do corpora contrast • When we don’t know if they are different • Find bugs in corpus construction
Kilgarriff: Measuring Corpus comparison • Qualitative • Quantitative
Kilgarriff: Measuring Qualitative • Take keyword lists • [a-z]{3,} • Lemma if lemmatisation identical, else word • C1 vs C2, top 100/200 • C2 vs C1, top 100/200 • study
Kilgarriff: Measuring Qualitative: example,OCC and OEC • OEC: general reference corpus • OCC: writing for children Look at fiction only Top 200 keywords (each way) what are they?
Kilgarriff: Measuring Do it • Sketch Engine does the grunt work • It’s ever so interesting
Kilgarriff: Measuring Quantitative • Methods, evaluation • Kilgarriff 2001, Comparing Corpora, Int J Corp Ling • Then: • not many corpora to compare • Now: • Many • Ad hoc, from web • First question: is it any good, how does it compare • Let’s make it easy: offer it in Sketch Engine
Kilgarriff: Measuring Original method • C1 and C2: • Same size, by design • Put together, find 500 highest freq words • For each of these words • Freqs: f1 in C1, f2 in C2, mean=(f1+f2)/2 • (f1-f2)2/mean (chi-square statistic) • Sum • Divide by 500: CBDF
Kilgarriff: Measuring Evaluated • Known-similarity corpora • Shows it worked • Used to set parameter (500) • CBDF better than alternative measures tested
Kilgarriff: Measuring Adjustments for SkE • Problem: non-identical tokenisation • Some awkward words: can’t • undermine stats as one corpus has zero • Solution • commonest 5000 words in each corpus • intersection only • commonest 500 in intersection
Kilgarriff: Measuring Adjustments for SkE • Corpus size highly variable • Chi-square not so dependable • Also not consistent with our keyword lists • Link to keyword lists – link quant to qual • Keyword lists • nf = normalised (per million) frequencies • Keyword lists: nf1+k/nf2+k • Default value for k=100 • We use: if nf1>nf2, nf1+k/nf2+k, else nf2+k/nf1+k • Evaluated on Known-Sim Corpora • as good as/better than chi-square
Kilgarriff: Measuring What’s missing • Heterogeneity • “how similar is BNC to WSJ”? • We need to know heterogeneity before we can interpret • The leading diagonal • 2001 paper: randomising halves • Inelegant and inefficient • Depended on standard size of document
Kilgarriff: Measuring New definition, method (Pavel) • Heterogeneity (def) • Distance between most different partitions • Cluster to find ‘most different partitions’ • Bottom-up clustering • until largest cluster has over one third of data • Rest: the other partition • Problem • nxn distance matrix where n > 1 million • Solution: do it in steps
Kilgarriff: Measuring Summary • Corpus comparison • Qualitative: use keywords • Quantitative • On beta • Heterogeneity (to complete the task) to follow (soon)
Kilgarriff: Measuring Simple maths for keywordsThis word is twice as common in this text type as that
Kilgarriff: Measuring • Intuitive • Nearly right but: • How well matched are corpora • Not here • Burstiness • Not here • Can’t divide by zero • Commoner vs.rarer words
You can’t divide by zero • Standard solution: add one • Problem solved Kilgarriff: Measuring
High ratios more common for rarer words • some researchers: grammar, grammar words • some researchers: lexis content words • No right answer • Slider? Kilgarriff: Measuring
Solution: don’t just add 1, add n • n=1 • n=100 Kilgarriff: Measuring
Solution • n=1000 • Summary Kilgarriff: Measuring