230 likes | 499 Views
LELA 30922 Lecture 2. Corpus-based research in Linguistics See esp. Meyer pp. 11-29. What is corpus linguistics?. Not a branch of linguistics, like socio~, psycho~, … Not a theory of linguistics
E N D
LELA 30922Lecture 2 Corpus-based research in Linguistics See esp. Meyer pp. 11-29
What is corpus linguistics? • Not a branch of linguistics, like socio~, psycho~, … • Not a theory of linguistics • A set of tools and methods (and a philosophy) to support linguistic investigation across all branches of the subject
Reminder • Assessment for this course is to use corpus/corpora to investigate something • This lecture may give you some ideas of the kind of thing you can do
Applications of corpus linguistics • Lexicology • Grammatical studies • Study of language variation • Historical linguistics • Contrastive analysis and translation theory • Study of language acquisition (psycholinguistics) • Language teaching
Lexicology • Study of behaviour of individual words • Particularly useful for dictionary construction (lexicography) • Can identify more and less frequently occurring words • More interesting is HOW words are used • Syntax • Meaning
Lexicology • Most frequent words are function words (the, of, and, to, a are 5 most frequent words in LOB) • If corpus is small, it can only give an indicative “snapshot” of word usage • LOB (1m words): hundreds of words occur less than 10 times
Lexicography • For dictionary construction, need bigger corpus • “Monitor” corpus, constantly updated and added to • Traditional lexicography: collection of “slips” by experts • OED took 50 years and includes 5m citations, sorted and edited manually • Same idea, but more systematic • Dictionary as descriptive rather than pre- (or pro-) scriptive
Lexicography • Collins COBUILD • Birmingham corpus (20m words, 1980s) • Bank of English corpus (415m words in Oct 2000) • 70m words of transcripts of BBC broadcasts • Used as basis of BBC English dictionary • Cambridge Language Survey • Longman’s corpus of American English, and use of BNC for (BrE) dictionary
Lexicography: how do corpora help? • Concordancing • Lists occurrences of word in context • Identify syntactic use of word • Identify range of meanings • Identify relative frequency of different uses/meanings • Collocation • What words occur together? • Compare distribution of close synonyms • Dictionaries can be subjective • Can be interesting to compare meanings/uses given by dictionaries with actual usage in corpora http://www.collins.co.uk/corpus/CorpusSearch.aspx
Target word = dog Significance measure: t-score
Grammatical studies • Study of a particular grammatical construction • Restrictions on form, meaning or context • Overall frequency (eg relative to alternative constructions) • Use in different registers (eg narrative vs argumentative) or modes (eg written vs spoken)
Examples of grammatical studies • Appositives • eg George Bush, US president or US president George Bush) • See CF Meyer “Can you really study language variation in linguistic corpora?” American Speech 79.4 (2004) 339-355 • Genuine titles, “pseudotitles”, descriptives • Junichiro Koizumi, the Japanese prime minister • Gerald Ford, former president of the USA • Osama bin Laden, America’s no.1 enemy • Looked at how appositives (esp. pseudotitles) are used differently in newspaper reports from different countries, and how descriptives become pseudotitles
Examples of grammatical studies • Clefts and pseudoclefts • It’s linguistics that interests me most. • What interests me most is linguistics. • Linguistics interests me most. • Infinitival complement clauses • I hope to go ~ I hope that I can go • I’m happy to go ~ I’m happy that I can go • … the proposal to go ~ the proposal that I go • Simple past vs perfective verb forms • Use of modals can~may, shall~will • Use of passive, and means/reasons to avoid • eg especially in translation
Grammatical studies • Most try to investigate the factors that determine choice of one construction over another • Lexical • Grammatical • Stylistic • etc
Grammatical studies • Corpus needs to be sufficiently marked up and tools need to be available for examples to be extracted • Corpus may need to be sufficiently large to get good number of examples • If comparing registers/subject domains/modes, corpus needs to reflect these
Study of language variation • Both lexical and grammatical studies often contrast usage by mode, domain, register etc. • Sociolinguists often interested in other aspects, eg sex, age, social class of author or audience; historical linguists interested in change over time • Recent corpora (eg BNC) have included this information in header mark-up • Simple examples • lovely used more by females than males • What does cool mean?
Genre classification • Are there lexical and grammatical factors that can help us to classify text genres? • Biber used statistical measures to identify stylistic factors that co-occurred, and could therefore be definitional of text types and genres • Eg conjuncts like therefore, nevertheless and use of passive together indicate more formal style • Factor analysis • choose a range of features to measure, see which ones are correlated • does not (necessarily) predetermine analysis (except obviously you have to decide what features might be significant)
Historical linguistics • Similar things can be done with historical texts, though (obviously) these are more limited in terms of genre • Also, diachronic studies can compare texts from different periods (again as long as you compare like for like as much as possible) • Topics: • Change in lexical meaning/usage • Change/emergence of grammatical constructions
Example of historical study • Nevalainen in J. Engl. Ling (2000) used Corpus of Early English Correspondence (U. Helsinki) to track sex roles in linguistic innovation • Popular theory that females more innovative, and males follow trends • He analysed sex-of-author differences in three linguistic changes between 16th and 20th century: • Replacement of ye by you in subject position • Replacement of 3rd-person verb suffix -th by -s • Reduction in use of multiple negatives and use of any and ever instead
Contrastive analysis, translation theory • Parallel corpora • texts + their translations • preferably “aligned” • Comparable corpora • Texts in different languages but of a similar nature • What parallels are there in genre characteristics?
Use of parallel corpora • Aligned corpus allows search for word or phrase and its translation • How is it translated? • Is it translated consistently? • Of interest in studies of “translationese” • Translated text too influenced by original • Certain constructions more prevalent in translation than in native text • Evidence of “explicitation” • Translation is often more explicit than original • Sometimes, explanation added for foreign reader • But often, just a reflection of the translator’s effort (eg replacement of pronoun by more explicit referent) • Also can be used as a tool for translators
Language acquisition • First-language acquisition • CHILDES database (Child Language Data Exchange System) http://childes.psy.cmu.edu/ • Transcriptions of conversations with (and between) young children • Includes software to help extract data • Second-language acquisition • Learner corpora, notably ICLE • http://www.fltr.ucl.ac.be/FLTR/GERM/ETAN/CECL/Cecl-Projects/Icle/icle.htm