100 likes | 120 Views
Learn about comparing corpora using frequency profiling, statistical tests, and semantic analysis for applications in information retrieval and social differentiation studies in English. Explore methodologies and key findings in corpus linguistics.
E N D
Comparing Corpora using Frequency Profiling Paul Rayson and Roger Garside UCREL research group Computing Department Lancaster University, UK. www.comp.lancs.ac.uk/ucrel/
Comparing Corpora • Brown versus LOB (Hofland & Johansson, 1982) • Comparison at word form or annotation level • Information retrieval and extraction applications
Two main types • Type 1: • sample corpus v. larger ‘standard’ normative corpus • Type 2: • two (roughly) equal sized corpora
Main issues of concern • representativeness (balance) • homogeneity within the corpora • comparability of the corpora • reliability of statistical tests
Statistics • Chi-squared unreliable • Mann-Whitney (Kilgarriff 1996) • Log-likelihood (Dunning 1993)
Method O1 = a O2 = b N1 = c N2 = d E1 = c*(a+b) / (c+d) E2 = d*(a+b) / (c+d) LL = 2*((a*log (a/E1)) + (b*log (b/E2)))
Application (REVERE) • Systems engineering application • User interview transcripts, standards documents, user manuals • POS tagged with CLAWS • Semantic analysis • Wmatrix retrieval tool • Frequency profiling and KWIC
Air traffic control • Ethnographic studies at ATC centre • Verbatim transcripts of observations and interviews with controllers • Unstructured reports • 103 pages
Key semantic categories Log-likelihood Semantic Word sense (and examples from the text) tag 3366 S7.1 power, organising (‘controller’, ‘chief’) 2578 M5 flying (‘plane’, ‘flight’, ‘airport’) 988 O2 general objects (‘strip’, ‘holder’, ‘rack’) 643 O3 electrical equipment (‘radar’, ‘blip’) 535 Y1 science and technology (‘PH’) 449 W3 geographical terms (‘Pole Hill’, ‘Dish Sea’) 432 Q1.2 paper documents and writing (‘writing’, ‘written’, ‘notes’) 372 N3.7 measurement (‘length’, ‘height’, ‘distance’, ‘levels’, ‘1000ft’) 318 L1 life and living things (‘live’) 310 A10 indicating actions (‘pointing’, ‘indicating’, ‘display’) 306 X4.2 mental objects (‘systems’, ‘approach’, ‘mode’, ‘tactical’, ‘procedure’) 290 A4.1 kinds, groups (‘sector’, ‘sectors’)
Conclusions • Method of comparing corpora using frequency profiling • Discovery of key items • Human verification of hypotheses • Applications in study of social differentiation in the use of English vocabulary, profiling of learner English and IE in SE domain