200 likes | 284 Views
Harnessing Corpora for real and virtual ELT purposes. IFELT Belinda Maia FLUP 10/11.2003. What is a corpus?. CORPUS - 13c: from Latin corpus body - plural corpora )
E N D
Harnessing Corpora for real and virtual ELT purposes IFELT Belinda Maia FLUP 10/11.2003
What is a corpus? • CORPUS- 13c: from Latin corpus body - plural corpora) • A body of texts, utterances or other specimens considered more or less representative of a language, stored as an electronic database. • A corpus corpora may store many millions of running words • A corpus can betaggedto identify and classify words and other formations • A corpus can be searched using concordancing programmes
An example of concordancing(from the BNC) A0R 2231 Maybe with twists of bacon. A35 256 This substantial, 15-minute orchestral movement was inspired by three paintings of Innocent X by Francis Bacon, themselves based on Velasquez. A6N 1311 They could cook vegetables and meat simply, deal with eggs and bacon and porridge, and they were able to bake and housekeep, learning as they went along. AAX 286 Sir Richard Body, MP Hirohito, shy god who liked bacon & eggs. ABB 67 Remembering bacon and ham, the versatility of the pig can be stretched to pies, sandwiches and ham, egg and chips. ABB 236 The Smoked Trout & Parma Ham Mousse (see p18) is merely decorated with slices of the ham and the Carbonnade of Beef is enriched by using diced ham instead of bacon.
Tagging • Example – courtesy Catherine Ball at: http://www.georgetown.edu/faculty/ballc/corpora/tutorial2.html#RTFToC16 • A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._.A01 3 ^ by_IN Trevor_NP Williams_NP ._.A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_INA01 4 nominating_VBG any_DTI more_AP labour_NNA01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NNA01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.
Types of Corpora • Monolingual corpora - in which the texts are all in the same language • Parallel and/or aligned corpora - in which originals and translations are aligned so that both texts appear on the screen together and you can see how the translator has translated the original. • Comparable corpora - in which a selection of original texts has been made in two or more languages dealing with the same subject or genre.
Types of Corpora • Specialized corpora - texts on specialized subjects for the extraction of terminology and complementary explanatory material - definitions, explanations etc. • Concurrent corpora - used to describe texts taken from newspapers on the same subject on approximately the same dates. • 'Do-it-yourself ' or ‘disposable’ corpora - small specialized corpora for the purpose of teaching translation or language
Corpora and Lexicography • COBUILD = Collins Publishers + University of Birmingham – 1980s • Corpora work that revolutionised lexicography • TODAY - All serious lexicography uses corpora - e.g. • Oxford English Dictionary http://www.oed.com/ • Academia das Ciências de Lisboa
Corpora & Grammar • The Longman Grammars of English (Quirk, Greenbaum, Svartvik, Leech and others) • Based on corpora – the classical corpora now availableon CD-ROM through ICAME • http://www.hd.uib.no/icame.html • BIBER, D., S. JOHANSSON, G. LEECH, S. CONRAD & E. FINEGAN. 1999. Longman Grammar of Spoken and Written English. Harlow: Pearson Education Ltd.
The corpora debate • The bigger the corpus, the better • The carefully chosen ‘representative’ corpora • Chomsky > the average educated speaker was a better source • Big corpora are not necessarily representative – e.g. The Hansard corpus • Any selection of texts – is a selection
Yet • Very Large corpora exist and are very useful • Much research work nowadays is done with small selected corpora for studying: • different registers • special subjects
Using official corpora - EN • British National Corpus at: http://sara.natcorp.ox.ac.uk/lookup.html- 50 examples of any word or expression for free on-line • CD-ROM of 100 million words available • The COBUILD projecthttp://titania.cobuild.collins.co.uk/form.html • 40 Examples on-line
Using official corpora - PT • AC/DC, CetemPúblico – Portuguese monolingual corpora • COMPARA – aligned English/Portuguese corpus • All at http://www.linguateca.pt
Language Learning/Teaching and corpora • How can a language teacher use corpora? • Why should a language learner need to know about corpora? • What can be learnt?
How can a language teacher use corpora? • The teacher can: • find an enormous amount of material for use in class, for exercises • check on real usage and compare it to textbooks used • BUT: • Must be aware that corpora sometimes prove the textbook wrong!
What can be learnt? • Corpora as reference material for: • Lexical work • Syntactic study • Textual analysis • Observing language ‘in action’ • Learning about a wide variety of areas
The student • Can be trained to search autonomously for information of all kinds • Finding texts that supply real knowledge • Finding texts that serve as models for style and register • Finding correct collocations of individual words
Do-it-yourself corpora • Suggestion: • Train students to make and use their own corpora by: • Collecting texts off the Internet • Using the ‘Find’ function in Word • Broadening their vocabulary
Useful sites Catherine N. Ball: Tutorial: Concordances and Corpora • http://www.georgetown.edu/faculty/ballc/corpora/tutorial.html • Tim John’s Data-driven learning at: http://web.bham.ac.uk/johnstf/
Useful sites • Concordance the whole Web at: http://www.webcorp.org.uk/ • And, of course, – Google at: • http://www.google.com