1 / 10

Corpus Comparison Methods for Language Studies

Learn about comparing corpora using frequency profiling, statistical tests, and semantic analysis for applications in information retrieval and social differentiation studies in English. Explore methodologies and key findings in corpus linguistics.

riverat
Download Presentation

Corpus Comparison Methods for Language Studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK. www.comp.lancs.ac.uk/ucrel/

  2. Comparing Corpora • Brown versus LOB (Hofland & Johansson, 1982) • Comparison at word form or annotation level • Information retrieval and extraction applications

  3. Two main types • Type 1: • sample corpus v. larger ‘standard’ normative corpus • Type 2: • two (roughly) equal sized corpora

  4. Main issues of concern • representativeness (balance) • homogeneity within the corpora • comparability of the corpora • reliability of statistical tests

  5. Statistics • Chi-squared unreliable • Mann-Whitney (Kilgarriff 1996) • Log-likelihood (Dunning 1993)

  6. Method O1 = a O2 = b N1 = c N2 = d E1 = c*(a+b) / (c+d) E2 = d*(a+b) / (c+d) LL = 2*((a*log (a/E1)) + (b*log (b/E2)))

  7. Application (REVERE) • Systems engineering application • User interview transcripts, standards documents, user manuals • POS tagged with CLAWS • Semantic analysis • Wmatrix retrieval tool • Frequency profiling and KWIC

  8. Air traffic control • Ethnographic studies at ATC centre • Verbatim transcripts of observations and interviews with controllers • Unstructured reports • 103 pages

  9. Key semantic categories Log-likelihood Semantic Word sense (and examples from the text) tag 3366 S7.1 power, organising (‘controller’, ‘chief’) 2578 M5 flying (‘plane’, ‘flight’, ‘airport’) 988 O2 general objects (‘strip’, ‘holder’, ‘rack’) 643 O3 electrical equipment (‘radar’, ‘blip’) 535 Y1 science and technology (‘PH’) 449 W3 geographical terms (‘Pole Hill’, ‘Dish Sea’) 432 Q1.2 paper documents and writing (‘writing’, ‘written’, ‘notes’) 372 N3.7 measurement (‘length’, ‘height’, ‘distance’, ‘levels’, ‘1000ft’) 318 L1 life and living things (‘live’) 310 A10 indicating actions (‘pointing’, ‘indicating’, ‘display’) 306 X4.2 mental objects (‘systems’, ‘approach’, ‘mode’, ‘tactical’, ‘procedure’) 290 A4.1 kinds, groups (‘sector’, ‘sectors’)

  10. Conclusions • Method of comparing corpora using frequency profiling • Discovery of key items • Human verification of hypotheses • Applications in study of social differentiation in the use of English vocabulary, profiling of learner English and IE in SE domain

More Related