350 likes | 522 Views
Corpus-Informed Teaching and Research 1. Ken Lau. Warm-Up Discussion. Work in pairs. Which of the following groups does not make a natural partnership in English? How can you find out the answer? situations arise difficulties arise problems arise suggestions arise disputes arise
E N D
Warm-Up Discussion • Work in pairs. Which of the following groups does not make a natural partnership in English? How can you find out the answer? • situations arise • difficulties arise • problems arise • suggestions arise • disputes arise • questions arise
What is a corpus? • Simply put, a corpus is a collection of texts in an electronic database. There are several characteristics / features of corpora which are worth thinking: • Not all corpora which can be used for linguistic analysis or research were originally built for those purposes • Electronic corpora can consist of whole texts or collections of whole texts
What is a corpus? • Texts in a corpus are (now) in a computer-readable format • Corpora are often assembled to be representative of some language or text type; authentic texts are thus collected • Corpora may be compiled for specific purposes, which in turn affect the design, size, and nature of the individual corpus. In this case, the texts are NOT supposed to be collected randomly but they are to be collected in a principled way.
Intuition vs evidence/corpus-based approach • As L2 speakers we may come across a situation when we have to decide a more idiomatic form/usage of a grammatical construction in the L1. For example, in the past, if we need to determine whether “suggestions arise” is correct in the warm up task we might rely on our intuition. However, with the use of corpora (with authentic texts), your decision will become evidence-based and more accurately reflect the language use.
Key Terms in CL • Representativeness • Mean and Standard Deviation (S.D.) • Raw Frequencies • Norminalising frequencies • Mutual Information • Other measures of collocation • Keyword
Representativeness • A key issue in any statistical analysis is whether a sample, or subset, of any population, or larger group, will accurately represent the variables or characteristic features associated with the population as a whole. • To apply this to linguistics, if we are going to make claims that a linguistic feature (the variable) is or is not characteristic of the language as a whole (the population), then we need to be convinced that its incidence in the texts that make up our corpus (the sample) accords with its incidence in the language more broadly. In short the sample we have needs to be representative of the population as a whole.
Representativeness • The larger the better/more reliable (if statistical analyses are the major part of your research, >1M words are needed) • Try to mirror the range and proportion of texts produced in everyday life. • The challenge: is it possible to achieve this ideal goal? (Consider, for example, what kinds of texts are needed if you want spoken data of daily conversation? Any foreseeable problems in data collection?)
Representativeness • Balance • British National Corpus (BNC) is considered a balanced corpus • ~ 100 million words; 90% written; 10% spoken • Written texts • Selected using three criteria: domain, time and medium • Domain: content type (subject field) • Time: period of the text production • Medium: types of text publication e.g. books, periodicals, etc. • Spoken texts • Selected using two criteria: demographic and context-governed • Demographic: informal encounters recorded by 124 volunteer respondents selected by age group, sex, social class and geographical region • Context-governed: formal encounters such as meetings, lectures and radio broadcasts recorded in four broad context categories (Education, business, institution, leisure)
Mean and Standard Deviation • Mean • Total number of words of a specific feature in question / Total number of words in the corpus • Standard Deviation (S.D.) • The actual number of the specific feature in any given text might vary considerably from the mean. Consider for example the number of hedging devices (e.g. seems, appears, may, could) in the three texts are 70, 120, 200 and so the mean is 130. However, only the second text has the number of hedging devices closer to the mean. It is therefore useful to have a measure of how far a variable is likely to deviate from the mean, i,e, the S.D. • A small S.D. will tell us that on average the variation from the mean is quite low – although there might of course be a few exceptional examples that vary quite widely from the mean. In the above example, the S.D is about 53.5 which shows quite a high degree of variation from the mean in the individual texts.
Mean and Standard Deviation per 1,000 words
Mean and Standard Deviation: Some Observations • We expect around 137.4 nouns to occur per 1,000 words in conversation. If an individual conversational text displays variation to one S.D. (that is +/- 15.6 occurrences from the mean), then that is very much expected. If, however, an individual conversation deviate greatly from this band frequencies (e.g. by 6 / 7 times the S.D.), then we can be relatively assured in our claim that they are unlike other texts, in terms of the number of nouns. • The figures for nouns show that the stylistic range of writing is greater than that of speech, accounting for the higher degree of variation found in the number of nouns found in the written registers. • Academic prose has a mean of 2.1 and a S.D. of 2.1 for conditional clauses, indicating that it would be entirely reasonable to find a stretch of 1,000 words containing no conditional clauses at all. • There are a lot more passives in academic prose, which highlights the impersonal nature of the texts.
Raw Frequency • The number of words occurring in a corpus.
Raw Frequency • Personal nature (with the high occurrences of I) • It’s related to presentation • Related to cognitive activities (think) and physical activities (make) • Adherence to certain rules/patterns (should)
Normalising Frequencies • They are used when comparing two data sets of unequal size. • They tell us the number of occurrences that we can expect, per thousand, or sometimes per million words
Normalising Frequencies 651 412 350 278 210 204 172 158 144 127
Mutual Information (MI) • Provides information of how commonly individual words collocate with others • It is generally accepted that an MI score higher than 3 suggests a strong bond between the search term and its collocate.
Mutual Information • What can you tell from the MI scores of the collocates of “reinforced” and “strengthened”. Check the MI scores following the procedures: • Go to http://corpus.byu.edu/bnc • Select “List” • Type “reinforced” in the Search box • Leave the Collocates blank (with *) [Keep the span of words 4 on each side] • In the sorting field, choose “relevance” • Click search • Repeat the same steps with the word “strengthened”
Mutual Information • You may also use the function of “Compare” to solicit information about collocation. Follow the steps below and compare the collocates of “Male” and “Female” • Go to http://corpus.byu.edu/bnc • Select “Compare” • In the search box, input “Male” and “Female” • Leave the Collocates blank (with *) [Keep the span of words 4 on each side] • Click Search
“Feminist vs Chauvinist” over time • Now use the Time Magazine Corpus (1923-2006) (http://corpus.byu.edu/time/). Search for the terms “feminist*” and “chauvinist*” what can you say about these terms in terms the changes in their frequencies since 1920s?
Keyword • Those expressions that have a significantly higher or lower frequency of occurrence in a text or set of texts than we should expect, given the frequency of occurrence of those expressions in a larger corpus used as a point of reference. • To determine whether a word is considered a keyword, the concept of log-likelihood is important. You do not need to worry about the calculations behind it; instead simply use the calculator created by Paul Rayson of the Lancaster University: • http://ucrel.lancs.ac.uk/llwizard.html
Keyword: Mortgage • Now try to see if the term mortgage is overused or underused in the Hong Kong Financial Services Corpus compiled by the Hong Kong Polytechnic University (reference corpus: Newspaper subcorpora of BNC) • Follow the procedures: • Go to http://rcpce.engl.polyu.edu.hk/HKFSC/ • Enter the word “mortgage” in the search box • Note the size of the corpus and then click search • Record the number of instances of “mortgage”
Keyword: Mortgage • Now try to see if the term mortgage is overused or underused in the Hong Kong Financial Services Corpus compiled by the Hong Kong Polytechnic University (reference corpus: Newspaper subcorpora of BNC) • Go to http://corpus.byu.edu/bnc • Select “Chart” • Enter the word “mortgage” in the search box and click search • Record the number of instances of ‘mortgage’ in the newspaper subcorpora and the size of the subcorpora • Enter all the information collected here: http://ucrel.lancs.ac.uk/llwizard.html • Write down the results below
Useful Online Corpora • Professional Specific Corpora • http://rcpce.engl.polyu.edu.hk/ • British Academic Written English Corpus • http://www.coventry.ac.uk/research/research-directory/art-design/british-academic-written-english-corpus-bawe/ • British Academic Spoken English Corpus • http://www.coventry.ac.uk/research/research-directory/art-design/british-academic-spoken-english-corpus-base/ • Michigan Corpus of Academic Spoken English (MICASE) • http://quod.lib.umich.edu/m/micase/?type=revise