1 / 24

Measuring Distance between Language Varieties

Measuring Distance between Language Varieties. Adam Kilgarriff, Jan Pomikalek , Pavel Rychly , Vit Suchomel Supported by EU Project PRESEMT. How to compare language varieties. Qualitative Quantitative Quantitative means corpus Corpus represents variety Compare corpora. My big question.

awena
Download Presentation

Measuring Distance between Language Varieties

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Measuring Distance between Language Varieties Adam Kilgarriff, JanPomikalek, PavelRychly, VitSuchomel Supported by EU Project PRESEMT

  2. Kilgarriff: Measuring How to compare language varieties • Qualitative • Quantitative • Quantitative means corpus • Corpus represents variety • Compare corpora

  3. Kilgarriff: Measuring My big question • How to compare corpora • How else can corpus methods/corpus linguistics be scientific • Roles • How do varieties contrast • How do corpora contrast • When we don’t know if they are different • Find bugs in corpus construction

  4. Kilgarriff: Measuring Corpus comparison • Qualitative • Quantitative

  5. Kilgarriff: Measuring Qualitative • Take keyword lists • [a-z]{3,} • Lemma if lemmatisation identical, else word • C1 vs C2, top 100/200 • C2 vs C1, top 100/200 • study

  6. Kilgarriff: Measuring Qualitative: example,OCC and OEC • OEC: general reference corpus • OCC: writing for children Look at fiction only Top 200 keywords (each way) what are they?

  7. Kilgarriff: Measuring

  8. Kilgarriff: Measuring Do it • Sketch Engine does the grunt work • It’s ever so interesting

  9. Kilgarriff: Measuring Quantitative • Methods, evaluation • Kilgarriff 2001, Comparing Corpora, Int J Corp Ling • Then: • not many corpora to compare • Now: • Many • Ad hoc, from web • First question: is it any good, how does it compare • Let’s make it easy: offer it in Sketch Engine

  10. Kilgarriff: Measuring Original method • C1 and C2: • Same size, by design • Put together, find 500 highest freq words • For each of these words • Freqs: f1 in C1, f2 in C2, mean=(f1+f2)/2 • (f1-f2)2/mean (chi-square statistic) • Sum • Divide by 500: CBDF

  11. Kilgarriff: Measuring Evaluated • Known-similarity corpora • Shows it worked • Used to set parameter (500) • CBDF better than alternative measures tested

  12. Kilgarriff: Measuring Adjustments for SkE • Problem: non-identical tokenisation • Some awkward words: can’t • undermine stats as one corpus has zero • Solution • commonest 5000 words in each corpus • intersection only • commonest 500 in intersection

  13. Kilgarriff: Measuring Adjustments for SkE • Corpus size highly variable • Chi-square not so dependable • Also not consistent with our keyword lists • Link to keyword lists – link quant to qual • Keyword lists • nf = normalised (per million) frequencies • Keyword lists: nf1+k/nf2+k • Default value for k=100 • We use: if nf1>nf2, nf1+k/nf2+k, else nf2+k/nf1+k • Evaluated on Known-Sim Corpora • as good as/better than chi-square

  14. Kilgarriff: Measuring

  15. Kilgarriff: Measuring

  16. Kilgarriff: Measuring What’s missing • Heterogeneity • “how similar is BNC to WSJ”? • We need to know heterogeneity before we can interpret • The leading diagonal • 2001 paper: randomising halves • Inelegant and inefficient • Depended on standard size of document

  17. Kilgarriff: Measuring New definition, method (Pavel) • Heterogeneity (def) • Distance between most different partitions • Cluster to find ‘most different partitions’ • Bottom-up clustering • until largest cluster has over one third of data • Rest: the other partition • Problem • nxn distance matrix where n > 1 million • Solution: do it in steps

  18. Kilgarriff: Measuring Summary • Corpus comparison • Qualitative: use keywords • Quantitative • On beta • Heterogeneity (to complete the task) to follow (soon)

  19. Kilgarriff: Measuring Simple maths for keywordsThis word is twice as common in this text type as that

  20. Kilgarriff: Measuring • Intuitive • Nearly right but: • How well matched are corpora • Not here • Burstiness • Not here • Can’t divide by zero • Commoner vs.rarer words

  21. You can’t divide by zero • Standard solution: add one • Problem solved Kilgarriff: Measuring

  22. High ratios more common for rarer words • some researchers: grammar, grammar words • some researchers: lexis content words • No right answer • Slider? Kilgarriff: Measuring

  23. Solution: don’t just add 1, add n • n=1 • n=100 Kilgarriff: Measuring

  24. Solution • n=1000 • Summary Kilgarriff: Measuring

More Related