Corpus-assisted discourse analysis

BBI3210 DR AFIDA MOHAMAD ALI Corpus-assisted discourse analysis

In linguistics, corpus (plural corpora) is a large and structured set of texts (now usually electronically stored,processed and analysed). A corpus may contain single texts in single language (monolingual corpus) or text data in multiple languages (multilingual corpus). Multilingual corpora that have been specially formatted for side-by-side comparison are called aligned parallel corpora. (Webster’s Online Dictionary) • A corpus is a collection of naturally-occurring language text, chosen to characterize a state or variety of a language. (Sinclair, Corpus, Concordance, Collocation, 1991:171) What is a corpus

Benefits of computer analysis of corpora

spoken vs. written monolingual vs. bi/multilingual parallel vs. comparable corpora (translation corpora) general language purpose vs. specialised language purpose diachronic vs. synchronic plain text vs. annotated (tagged) text TYPES OF CORPORA

Spoken Corpora • aim at representing spoken language • London-Lund Corpus (LLC) • Lancaster/IBM Spoken English Corpus (SEC) • Cambridge and Nottingham Corpus of Discourse in English (CANCODE) • Santa Barbara Corpus of Spoken American English (SBCSAE) • Wellington Corpus of Spoken New Zealand English (WSC)

Written Corpora • aim at representing written language • BROWN Corpus (written texts, AE in 1961) • LOB Corpus (Comparable to BROWN Corpus, BE, early 1960s) • FROWN Corpus (AE, Early 1990s) • FLOB Corpus (BE, Early 1990’s)

Multilingual Corpora • aim at representing several, at least two, different languages, often with the same text types (for contrastive analyses) • Parallel corpora (source texts plus translations): Canadian Hansard • Comparable corpora (monolingual subcorpora designed using the same sampling techniques): Aahrus corpus of contract law • Multilingual • Bilingual

Multilingual Corpora Important resources for translation and contrastive studies. Multilingual corpora… • …give new insight into the language compared • …can be used to study language specific and universal features • …illuminate differences between source texts and translations • …can be used for a number of practical applications, in lexicography, language teaching, translation, etc.

Parallel Corpora • Bilingual vs.Multilingual • Unidirectional(from La to Lb or from Lb to Lc alone) vs. Bidirectional(from La to Lband from Lb to La) vs. Multidirectional(from La to Lb,Lc etc.)

Comparable corpora A corpus containing components that are collected using the same sampling techniques and similar balance and representativeness, e.g. the same proportions of the texts of the same genres in the same domains in a range of different languages in the same sampling period.

For the latest comprehensive website on corpora and corpus tools, go to http://www.uow.edu.au/~dlee/CBLLinks.htm

Comparable vs. parallel corpora The sampling frame is essential for comparable corpora but not for parallel corpora because the texts are exact translations of each other.

General Corpora • Broadest type of corpus – very large, more than 10 million words, and contain a variety of language so that findings from it may be somewhat generalized. • Although no corpus will ever represent all possible language, generalized corpora seek to give users as much of a whole picture of a language as possible. • Analysis of patterns of language use as a whole.

Examples; • British National Corpus (BNC 100,106,008 words) • The American National Corpus • ICE – regional corpus • COCA (The Corpus of Contemporary American English) • These large, generalized corpora contain written texts newspaper and magazine articles, works of fiction and nonfiction, writing from scholarly journals, spoken transcripts (informal converstaions, government proceedings and business meetings)

If generalizations about language as a whole are to be drawn, a large general corpus should be consulted.

Specialized Corpora • Compiled to desribe language use in a specific variety, register or genre. • Contains texts of a certain type and aims to be representative of the language of this type. • It can be large or small and are often created to answer very specific questions. • MICASE (1,700,000 words of English spoken in the academic domain) • Contains only spoken language from a university setting

CHILDES Corpus - contains language used by children MICUSP (Michigan Corpus of Upper-level Student Papers) – a collection of papers from a range of university disciplines Medical corpus – contains language used by nurses and hospital staff Guangzhou Petroleum English Corpus (411,612 words of written English from the petrochemical domain) HKUST Computer Science Corpus (1,000,000 words of written English sampled from undergraduate textbooks in computer science.

CPSA (Corpus of Professional Spoken American English) Specialized corpora – often used in ESP settings The AWL – was generated from a specialized corpora of academic texts

Diachronic Corpora Also known as historical corpora. Texts date to different periods in time. Ideal to study language change and history. • Brown/Frown • Lob/Flob • Helsinki Diachronic Corpus of English Texts (8th-18th century) • Archer Corpus – A representative Corpus of Historical English Registers (BE and AE, 1650-1990).

Synchronic Corpora Useful to compare varieties of English. Texts date all to the same period. • Brown and Lob • Frown and Flob • International Corpus of English (ICE) (Texts produced after 1989) • BNC

Learner/developmental Corpora • Specialized corpus that contains written texts and/or spoken transcripts of language used by students who are currently acquiring the language. • aim at representing the language as produced by learners of this language . • Learner corpora are often tagged and can be examined, e.g., to see common errors students made.

Lstr or L2 acquisition/L1 acquired by children • International Corpus of Learner English – ICLE (LC) • Generalized corpora • Contains essays written by English language learners with 14 different native languages. • Standard Speaking Test Corpus (SST) • More specialized • E.g., comprised of oral interview tests of Japanese learners.

Other examples; • CHILDES (DC) • Cambridge Learner Corpus (LC) • Targeted instruction can be developed for general language teaching or for specific language groups depending on the type of learner corpus.

It is a corpus that contains language used in classroom settings. It can include academic textbooks, transcripts of classroom interactions, or any other written text or spoken transcript that learners encounter in an educational setting. Pedagogic Corpora

Lexicography / terminology • Linguistics / computational linguistics Dictionaries & grammars (Collins Cobuild English Dictionary for Advanced Learners; Longman Grammar of Spoken and Written English Critical Discourse Analysis - Study texts in social context - Analyze texts to show underlying ideological meanings and assumptions - Analyze texts to show how other meanings and ways of talking could have been used….and therefore the ideological implications of the ways that things were stated • Literary studies • Translation practice and theory • Language teaching / learning ESL Teaching LSP Teaching (exemplar texts) Uses of Corpora

Issues such as • How common are different words? • How common are the different senses for a given word across registers? • Do words have systematic associations with other words? • Do words have systematic associations with particular registers or dialects? Lexicography and corpora

Research on empirical linguistics • Study language use in various aspects – Verify linguistic theory, e.g. the explanation of definite description, – Lexical studies e.g. study near synonymous ‘little’ ‘small’ – Sociolinguistics : compare the different of languages produced from different social groups (m/f) – Cultural study e.g. differences found in 2 comparable corpora (British/American) …. Linguistics and Corpora

Corpus based : use corpus as a resource • Knowledge : – Know better about English answer specific questions of certain words, phrases, structures. – Know where the problems are error analysis on a learner corpus – Know what should be taught word frequency, comparing native/learner corpora Language Teaching and Corpus-based approach

References : – create better references dictionary, grammar book, textbooks – verify certain hypotheses about languages find support examples / counter examples – use a native corpus as a reference see whether it is possible which one is more natural Language Teaching and Corpus-based approach

Corpus based : use corpus as a resource Syllabus design : – Native corpora => what are actually used – Learner corpora => what are the problems – Find out which aspects should be given priority – Lexical syllabus = focus on frequency of occurrence – How many words the students should know? What are they? – Knowing 90% or 95% of the words? Language Teaching and Corpus-based approach

“In a corpus-driven approach the commitment of the linguist is to the integrity of the data as a whole, and descriptions aim to be comprehensive with respect to corpus evidence. The corpus, therefore, is seen as more than a repository of examples to back pre-existing theories or a probabilistic extension to an already well defined system. […] Examples are normally taken verbatim, in other words they are not adjusted in any way to fit the predefined categories of the analyst; recurrent patterns and frequency distributions are expected to form the basic evidence for linguistic categories; the absence of a pattern is considered potentially meaningful.” (Tognini-Bonelli, Corpus linguistics at work, 2001:84) Language Teaching and Corpus-driven approach

Corpus driven – provides new paradigm of teaching/learning – students as a researcher – data driven learning – learn how to use concordance + corpora – extract generalization from data – Is it possible? Language Teaching and Corpus-driven approach

Intuition alone is not enough – Is “starting” always replaceable by “beginning”? – Is it only “time” that is “immemorial”? – “think of” vs. “think about” • Native speaker intuition is unreliable – provides no information on frequency of occurrence – “head” => body part - Is this the most used sense? • Help answering questions of usage easily – More than one character is/are – Worth to do / worth doing - toward / towards • Is it sheer a synonym of pure, complete, utter and absolute? Why use a corpus?

Text vs. Corpus(Tognini-Bonelli 2001: 3)

Text vs. Corpus From time to time there is also the need for high quality information to support particular initiatives, such as the (successful) application for accreditation. Some progress has been made in recording data on the Polytechnic 's rooms and buildings, and on the teaching space requirements of individual courses. These data are analysed, along with the database on course details and students ' course and module registrations, using the methodology in DES Design Note 44. Ad hoc reports are an essential part of any system that aspires not merely to process data routinely but to permit management information to be creamed off the top.

Word frequency Concordance Collocation Key word Dispersion plots TYPES OF ANALYSIS

Frequency counts – can be in raw data or percentages. • Frequency analyses allows • comparison between different words in a corpus. • Ascertain grammatical forms in a corpus • Word list to be created - a list of all of the words in a corpus along with their frequencies and the percentage contribution that each word makes towards the corpus. frequency

A concordance is simply a list of all of the occurrences of a particular search term in a corpus, presented within the context that they occur in; usually a few words to the left and right of the search term. A concordance is also sometimes referred to as key word in context or a KWIC. Here key word simply means the word that is currently under examination - and that can be any word that takes the interest of the researcher. Concordance

All words co-occur with each other to some degree. • However, when a word regularly appears near another word, and the relationship is statistically significant in some way, then such co-occurrences are referred to as collocates and the phenomena of certain words frequently occurring next to or near each other is collocation. Collocation

The notion of keyness derives from keywords. • Keywords are words which are significantly more frequent in one corpus than another (Hunston 2002). • They are words that are either unique or specific which are found more frequently in a specialised corpus compared with a general reference corpus. • These words can be one of the defining characteristics of the specialized corpus. KEYNESS

The rate of occurrence of a word or phrase across a particular file or corpus. A dispersion plot enables us to visually determine whether a term is equally spread throughout a text or occur as a central theme in one or more parts of the text. Dispersion

Corpus-assisted discourse analysis

Corpus-assisted discourse analysis

Presentation Transcript

Discourse Analysis

Discourse Analysis

Discourse Analysis

The Nora Corpus An analysis of Arab EFL discourse

DISCOURSE ANALYSIS

Using Corpus Tools in Discourse Analysis

Discourse Analysis

Discourse analysis

Discourse analysis

Discourse Analysis

DISCOURSE ANALYSIS

DISCOURSE ANALYSIS

DISCOURSE ANALYSIS

DISCOURSE ANALYSIS

Corpus-assisted discourse analysis

Discourse Analysis

RST Discourse Corpus

DISCOURSE ANALYSIS

Discourse Analysis

Discourse Analysis

DISCOURSE ANALYSIS

DISCOURSE ANALYSIS