390 likes | 655 Views
Current trends in corpus linguistics. Sinclair (1991 :171) .
E N D
Sinclair (1991 :171) • A corpus is a collection of naturally-occurring language text, chosen to characterize a state or variety of a language. In modern computational linguistics, a corpus typically contains many millions of words: this is because it is recognised that the creativity of natural language leads to such immense variety of expression that it is difficult to isolate the recurrent patterns that are the clues to the lexical structure of the language.
EAGLES (Expert Advisory Group on Language Engineering Standards) • A corpus is a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language. • Note that the non-committal word `pieces' is used above, and not `texts'. This is because of the question of sampling techniques used. If samples are to be all the same size, then they cannot all be texts. Most of them will be fragments of texts, arbitrarily detached from their contents.
A computer corpus is a corpus which is encoded in a standardised and homogenous way for […] retrieval tasks. Its constituent pieces of language are documented as to their origins and provenance.
The Text Encoding Initiative (TEI) • An international standard that helps libraries, museums, publishers, and individual scholars represent all kinds of literary and linguistic texts for online research and teaching, using a strict encoding scheme • Its main aim is the reusability of corpora • http://www.tei-c.org/
The International Computer Archive of Modern and Medieval English (ICAME) • An international organization of linguists and information scientists working with English machine-readable texts.
Corpus design • In order to draw conclusions that are significant, one has to adhere to clearly defined rules in the composition of a corpus. • If there is a selection bias, the conclusions will not be valid. • Sinclair (1991:13) even argues that the job could be « outsourced » to social scientists.
Spoken vs. Written • Most corpora are short on data that reflect spoken use of the language • EAGLES guidelines warn against the use of material that is not “gathered from the genuine communications of people going about their normal business. […] For example, some television shows deliberately put participants into artificial and indeed bizarre conditions and induce extremely odd responses. Casual conversation is expected to be impromptu but it can be rehearsed by one or more parties.”
The birth of corpus linguistics • Corpus linguistics is linked with the advent of the computer. • Computational Analysis of Present-Day American English (Kucera and Francis 1967). • The Brown Corpus (1960) was a carefully compiled selection of current American English, totalling about a million words drawn from a wide variety of sources. • Kucera and Francis subjected it to a variety of computational analyses. Their book combines elements of linguistics, psychology, statistics, and sociology.
Corpus linguistics in the UK • The British National Corpus (100 million words of modern British English, 10% spoken). • It has inspired various works, notably Sinclair (1990). • It is searchable through the website Phrases in English.
Corpus linguistics in France • The FRANTEXT database was created in the 1960s and is maintained by the INALF. • It contains texts that range from the Renaissance period to modern French • The corpus is made up of about 80% literary works and 20% technical or scientific writing. • It served as a basis for the «Trésor de la langue française informatisé » http://atilf.atilf.fr/tlf.htm • Base lexicale du français (Binon, Verlinde) • http://ilt.kuleuven.be/blf/
Fillmore’s description of the two approaches in " Corpus Linguistics” or “Computer-aided armchair linguistics”’ (1992) • The corpus linguist : "He has all the primary facts that he needs, in the form of a corpus of approximately one zillion running words, and he sees his job as that of deriving secondary facts from his primary facts. At the moment, he is busy determining the relative frequencies of the eleven parts of speech as the first word of a sentence versus the second word of a sentence."
The "armchair" (introspective) linguist: "He sits in a deep soft armchair, with his eyes closed and his hands clasped behind his head. Once in a while he opens his eyes, sits up abruptly shouting, ‘Wow, what a neat fact!’, grabs his pencil, and writes something down… having come still no closer to knowing what language is really like."
Chomsky’s opinion about corpus linguistics (1958 conference) • “Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description [based upon it] would be no more than a mere list.”
Chomsky criticized corpus data as being only a small sample of a potentially infinite population. • This criticism can be applied not just to CL but to any form of scientific investigation which is based on sampling. • Chomsky’s criticism was based on the fact that corpora were relatively small when he started airing those views.
Chomsky on corpus linguistics (2004 interview) • “Corpus linguistics doesn't mean anything. It’s like saying […] suppose physics and chemistry decide that instead of relying on experiments, what they’re going to do is take videotapes of things happening in the world and they’ll collect huge videotapes of everything that’s happening and from that maybe they’ll come up with some generalizations or insights.”
Performance may be flawed/ ungrammatical, due to attention/ memory lapses or other psychological factors – and consequently cannot be taken at face value. The ‘raw data’ has to be ‘idealised’.
Chomsky (1965, p. 4) admitted the similarity between the competence-performance distinction and that of the Saussurian langue-parole; • but to him, whereas langue is merely a "systematic inventory of items," competence refers to the conception of 'a system of generative processes." • The motivation for the distinction stems from the observations of fluctuations in grammaticality of the speech of individuals and the ascription of a proper theoretical significance to this observation, ( the speech of individuals does not directly reflect their grammatical knowledge).
A mature speaker's knowledge of his language does not fluctuate from moment to moment as does grammaticality of his utterances • Consequently, the linguist's task in building a grammar of his native language becomes in effect, one of describing the speaker's "permanent knowledge" of his language, or, his linguistic competence. • It is then left for the psychologist to describe how the interfering effects that manifest themselves during speaking interact with the speaker's linguistic-competence to produce the grammatically impaired utterances that are typical in everyday situations.
Corpus-based linguistics The essential characteristics of corpus-based analysis according to Biber (1998:4) • it is empirical, analysing the actual pattern of use in natural texts; • it utilizes a large and principled collection of natural texts, known as a “corpus”, as the basis for analysis; • it makes extensive use of computers for analysis, using both automatic and interactive techniques; • it depends on both quantitative and qualitative techniques.
Tagging • The first editions of the TEI Guidelines used the Standard Generalized Markup Language (SGML) • The most recent edition can also be expressed in the Extensible Markup Language (XML)
An example • <pb n='474'/> • <div1 type="chapter" n='38'> • <p>Reader, I married him. A quiet wedding we had: he and I, the parson and clerk, were alone present. When we got back from church, I went into the kitchen of the manor-house, where Mary was cooking the dinner, and John cleaning the knives, and I said —</p> • <p><q>Mary, I have been married to Mr Rochester this morning.</q> The housekeeper and her husband were of that decent, phlegmatic order of people,[…]; but Mary, bending again over the roast, said only — • </p> <p><q>Have you, miss? Well, for sure!</q></p>
a TEI document at the textual level consists of the following elements: • <front> • contains any prefatory matter (headers, title page, prefaces, dedications, etc.) found before the start of a text proper. • <group> • contains a number of unitary texts or groups of texts. • <body> • contains the whole body of a single unitary text, excluding any front or back matter. • <back> • contains any appendixes, etc., following the main part of a text.
Part-of-speech tagging • The man who was mixing it fell into the cement he was mixing. • the/DT man/NN who/WP was/VBD mixing/VBG it/PRP fell/VBD into/IN the/DT cement/NN he/PRP was/VBD mixing/VBG • The horse raced past the barn fell. • the/DT horse/NN raced/VBD past/JJ the/DT barn/NN fell/VBD
Expression of syntactic dependencies via square brackets • [S [NP [NP [Det the][N man]][S [NP who][VP was mixing it]]] [VP [V fell] [PP [P into][NP [NP [Det the][N cement]][S he was mixing]]]]].
Semantic tagging • It is still in its infancy, but some promising applications using word sense disambiguation have been tested on easy cases (e.g. pen in English). • Wordnet is a thesaurus-like data base that groups various word senses in synsets. It is available in the major European languages.
Study of lexical co-occurrence • This is done through the use of concordancing software, which provides a KWIC (Key-Word in Context) display. • Such software also provides a wide range of statistical information about the corpus and the collocates of any given word.
Example of a KWIC display 1. ent été piratés en 2005, hors piratages numériques via Internet, selon l'OCDE. 20080 ... 2. ient été piratés en 2005, hors piratage numériques via Internet, selon des chiffres publié ... 3. La Loi sur la confiance dans l'économie numérique (LCEN) ne prévoit pas une responsabilit ... 4. alisation de la technologie de synthèse numérique d'horloges de référence multiples (MRCG ... 5. MRCG) de Motorola. « Cette technologie numérique permet de s'affranchir des limites des ... 6. ue d'information, notamment d'appareils numériques multifonctions réseau (MFP) et d'imprim ... 7. tifs aux offres de logiciels d'imagerie numérique de Peerless, ainsi que tous les brevets ... 8. fabrique et commercialise des copieurs numériques couleur et noir et blanc, des appareils ... 9. puissant et des dernières technologies numériques réseau, Kyocera Mita soutient les entre ... 10. des marchés de l'imagerie documentaire numérique, comprenant notamment les fabricants de ... 11. couleur et monochromes, et d'appareils numériques. Afin de traiter les textes numériques ... 12. numériques. Afin de traiter les textes numériques et les graphiques, les produits d'image ... 13. s, les produits d'imagerie documentaire numériques se basent sur un logiciel d'imagerie et ... 14. lorsque cet objet est un enregistrement numérique, les États membres peuvent prévoir que ... 15. à disposition du demandeur, sous format numérique, sur un ou plusieurs sites publics acce ... 16. (faux médicaments et moyens de stockage numérique faciles à copier). L'octroi du ma ... 17. (faux médicaments et moyens de stockage numérique faciles à copier). L'octroi du ma ... 18. l Rights Management (gestion des droits numériques, euphémisme pour "protection contre la ... 19. l Rights Management (gestion des droits numériques, euphémisme pour "protection contre la ... 20. (faux médicaments et moyens de stockage numérique faciles à copier). L'octroi du ma ...
Collocates for « droits » in a French corpus containing texts about intellectual property 59 respect 28 certains 44 voisins 214 propriété 31 aspects 14 tous 37 fondamentaux 4 concernés 23 protection 5 les 13 d'auteur 4 location 15 titulaires 5 différents 13 réservés 4 protégés 15 Charte 4 nouveaux 8 incorporels 4 — 8 protéger 4 lesdits 8 nationaux 3 page 6 Application 3 ayants 6 protégés 3 6 Inc 1 ces 5 exclusifs 3 brevet 5 respecte 1 II 5 visés 3 libertés 5 Français 1 section 5 d’auteur 2 sinon
Terminological extraction • TE is one of the fastest developing applications in the field of natural language processing (NLP), along with computer-assisted translation (CAT). • It is based on the automatic identification of typical terminological syntactic patterns (e.g. ADJ N or N N in English). • Terminological extraction produces a list of “candidate terms” from which the noise must be sifted.
An example of N-ADJ patterns drawn from the same corpus 532 propriété intellectuelle 61 parlement européen 57 propriété industrielle 49 santé publique 38 sanctions pénales
"Intelligent" automatic term extraction needs to focus on word sense disambiguation to reduce the amount of noise. • The frequency criterion cannot be applied too systematically if the extraction process is meant to be comprehensive (many terms occur only once in a given corpus).
Learner corpora • They are corpora compiled with texts written by non-native students in a given foreign language. • Study of such corpora allows language teachers to focus on the most frequent grammar mistakes that are typical of a particular language pair, and on any over- or under-used syntactic patterns or lexical items. • The major learner corpus project is the International Corpus of Learner English headed by Sylviane Granger (Université de Louvain-la-Neuve, Belgium).
Bilingual corpora • There are two kinds of bilingual corpora : • Translation corpora, which consist of translated texts that are generally aligned at sentence level (they may involve more than two languages). • Comparable corpora, in which both halves have a common subject matter but are not mutual translations. • The appellation "parallel corpus" is considered ambiguous, as it may be used to refer to either kind of corpus.
Using the Web as a corpus • The web does not fit most linguists’ definitions of a corpus. • Sinclair (1991), p.171 : A corpus is a collection of naturally-occurring language text, chosen to characterize a state or variety of a language. • Biber (1998), p. 4 : a large and principled collection of natural texts
The Web may be viewed as a very large corpus, which is constantly being updated, and cannot possibly be annotated. • If it is to be used as a sample for linguistic exploration, questions must be raised about what exactly it is representative of. • It is probably biased as regards several social categories (age, gender, social class) and is consequently not representative of general usage. • Furthermore, an undefined percentage of its contents (probably high in English) is posted by non-natives.