250 likes | 410 Views
English Corpora and Language Learning. Tamás Váradi varadi@nytud.hu. Outline. What is a Corpus? Compiling a corpus First generation of corpora: BROWN, LOB The Age of Mega Corpora British National Corpus International Corpus of English International Corpus of Learner English
E N D
English Corpora and Language Learning Tamás Váradi varadi@nytud.hu
Outline • What is a Corpus? • Compiling a corpus • First generation of corpora: BROWN, LOB • The Age of Mega Corpora • British National Corpus • International Corpus of English • International Corpus of Learner English • The Web as a corpus? • Availability English Corpora and Language Learning
Corpora? (1) A collection of texts especially if complete and self contained; the corpus of Anglo-Saxon verse (2) In linguistics and lexicography, a body of texts, utterances or other specimens considered more or less representative of a language and usually stored as an electronic database (The Oxford Companion to the English Language 1992) A collection of naturally occurring language text chosen to characterize a state or variety of a language John Sinclair Corpus Concordance Collocation OUP 1991 English Corpora and Language Learning
The pre-electronic era • Huge, painstaking manual effort • Covering a closed body of texts • Bible Concordance • Shakespeare Concordance • Attempt to capture the whole language English Corpora and Language Learning
Compiling a corpus • Aim • provide solid empirical evidence about language • Design • geographical and chronological bounds • speakers, genres, • defined by future use • Representative corpora? • Annotation • Output English Corpora and Language Learning
Corpus Linguistics: the early phase • Early Sixties • BROWN Corpus 500 texts of 2000 words each • LOB corpus British counterpart • Classic reference works • Part of speech tagged English Corpora and Language Learning
Survey of English Usage • A major undertaking at UCL led by Sidney Greenbaum • 1 m word compilation • very careful annotation • 500 words spoken material • LONDON-LUND Corpus English Corpora and Language Learning
Structure of SEU English Corpora and Language Learning
LOB corpus: a sample A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._. A01 3 ^ by_IN Trevor_NP Williams_NP ._. A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN A01 4 nominating_VBG any_DTI more_AP labour_NN A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._. English Corpora and Language Learning
Concordance output English Corpora and Language Learning
The age of Mega Corpora • COBUILD • John Sinclair at University of Birmingham • originally 20 m words • now over 300 m word BANK of English • the more the better • no fixed size: the idea of a Monitor corpus English Corpora and Language Learning
A major undertaking in the mid-nineties • Birmingham, Lancaster – OUP,Longman,Chambers • 100 m words carefully compiled • 10 m words spoken data ! • up-to-date standarg SGML encoding • still the paradigm example of a reference corpus English Corpora and Language Learning
Accessing the BNC English Corpora and Language Learning
BNC-Baby English Corpora and Language Learning
Searching LOB/BROWN English Corpora and Language Learning
International Corpus of English • A network of corpora corvering regional variaties of English • Project organized by UCL London • Each containing cc. 1 m. words • GB, Hong-Kong Australia, East-Africa more in preparation English Corpora and Language Learning
ICE-HK English Corpora and Language Learning
ICE-GB: sociolinguistic variation English Corpora and Language Learning
ICE-GB: syntactic annotation English Corpora and Language Learning
Treebanks • Geoffrey Sampson • Meticulously hand-crafted syntactic annotation • SUSANNE • CHRISTINE • LUCY • Penn-Treebank • University of Pennsyvania • Massive amounts of utomatically annotated data aimed for natural language processing work English Corpora and Language Learning
International Corpus of Learner English • International Centre of English Corpus Linguistics Catholic University of Louvain led by Sylviane Granger • collection of essays • student profiles • Hungarian-English in preparation English Corpora and Language Learning
Susanne Corpus • Aims of the Scheme • comprehensive — covering all features of surface and logical English grammar that are definite enough to be susceptible of formal annotation, and including all phenomena that occur in practice in modern English • explicit — if two researchers at separate sites are given the same sample of English and asked to annotate it according to the SUSANNE standards, their annotations should be identical • nonpartisan — where aspects of grammar are the subject of theoretical controversy, the SUSANNE scheme aims to embody a neutral analysis which rival theoreticians can interpret in their own preferred terms English Corpora and Language Learning
The Web as a corpus • Why sample when you can access the whole? • Huge and ever changing • The ultimate in authenticity? • Not necessarily … English Corpora and Language Learning
The Webcorp project English Corpora and Language Learning
http://devoted.to/corpora English Corpora and Language Learning