240 likes | 257 Views
This research outlines the challenges in compiling and analyzing ESP corpora, including the balance of size, representativeness, and the role of context in interpretation. It discusses the units for linguistic analysis and the debate between top-down and bottom-up approaches.
E N D
Researching ESP Corpora: Issues in compilation and analysisLynne Flowerdew
Outline Compilation • Size • Representativeness • balance Analysis and interpretation • Units for linguistic analysis • Top-down vs. bottom-up analysis • Role of context in interpretation of corpus data
Compilation Size Commonly held view − the larger the better “…a corpus should be as large as possible and keep on growing” (Sinclair 1991: 18) “…it is important to have a substantial corpus if you want to make claims based on statistical frequency” (Bowker & Pearson 2002: 48)
Compilation But size of corpus • highly dependent on phenomenon one is investigating (de Haan 1992) • lower the frequency of the feature under investigation, larger the corpus (McEnery & Wilson 2001: 154) • Smaller corpora can be used for investigating more common features of language (Biber 1990) • Different picture for ESP corpora (see Flowerdew 2004; Hunston 2009; Koester 2010 for pointers on building small, specialised corpora)
Compilation General vs. ESP corpora (Sinclair 2005 : 16)
Compilation Representativeness • Specialised corpora do not exhibit as much internal variation as general corpora • Greater variation in the corpus text, the more samples and larger corpus required to ensure representativeness (Meyer 2002) • “We should always bear in mind that the assumption of representativeness must be regarded as an act of faith, as at present we have no means of ensuring it, or even evaluating it objectively” (Tognini-Bonelli 2001: 57)
Compilation Corpus of EIA (Environmental Impact Assessment) reports • 60 reports, approx. 225,000 words • Selected on basis represent 23 different environmental consulting companies • Impossible to select equal number of reports from each of companies; “convenience sampling” (Meyer 2002) • Larger the company, more reports catalogued in library; distribution seen as reflecting size and importance of company
Compilation Corpus of EIA reports Balancedness • Balanced corpus would consist of the same amount of text from each of the 23 companies • If EIA reports from different companies were of different lengths then balancing the corpus in terms of number of texts would lead to an imbalance in terms of number of tokens
Compilation Pragmatic considerations • Size balanced against level of delicacy of investigation (Kennedy 1998) • my investigation primarily qualitative (phraseologies of keywords for P-S pattern) • Investigation is of key vocabulary items – 225,000 words deemed sufficient
Analysis Units for linguistic analysis • Frequency (Kennedy 1998) • Keywords (Bondi 2001; Flowerdew 2008; Mudraya 2006; Nelson 2006) • Lexical bundles (Biber et al. 2004; Hyland 2007, 2008) • Corpus set up lexically rather than grammatically (Halliday 2004)
Analysis Comparison of ESP corpora with Coxhead’s AWL • Disciplinary differences Hyland & Tse 2007; Chen & Ge 2007; Martinez et al. 2009 • Common core of academic vocabulary (AWL) Paquot 2010; Simpson-Vlach & Ellis 2010
Analysis Corpus analysis driven by type of software • WMatrix (Rayson 2008) • Classifies vocabulary into semantic fields (Ali Mohamed 2007) • ConcGram (Greaves 2009) • Finds sets of words that co-occur (e.g. AB; A*B), allowing up to 12 slots for constituency variation • Searches for positional variation (e.g. AB; BA) • Only a few studies (Cheng 2009; Durrant 2009; Milizia & Spinzi 2009; Warren 2011)
Analysis My corpus of EIA reports • WordSmith Tools for keyword extraction (Scott 1999) • Then manually classified lexico-grammar of keywords into causal / non-causal categories • The export scheme will create a noise problem • In order to alleviate the problem of noise… • Severe traffic noise problems already exist in.. • WMatrix automatic identification of causal categories & CongGram for positional variation (e.g. problem of noise)
Analysis • Top-down vs. bottom-up In the ‘top-down’ approach, the functional components of a genre are determined first and then all texts in a corpus are analysed in terms of these components. In contrast, textual components emerge from the corpus analysis in the ‘bottom-up’ approach, and the discourse organization of individual texts is then analysed in terms of linguistically-defined textual categories. (Biber, Connor & Upton 2007a: 11)
Analysis Bottom-up starting point • Phraseology of preposition “in” in cancer research articles (Gledhill 2000) • Politeness strategies in two moves in job application letters (Upton & Connor 2001) • Verb-noun collocations in 4 moves in law cases (Bhatia et al. 2004) • Phraseology of “research” in moves in PhD literature reviews (J. Flowerdew & Forest 2010)
Analysis Top-down starting point • Kanoksilapatham’s (2007) corpus study of biochemistry research articles; first developed analytical framework through identifying moves • In reality, many studies toggle between the two (Charles 2006) • Different starting points yield different results (Biber, Connor & Upton 2007b)
Analysis Corpus of EIA reports • Devised a coding system to account for 3 different levels of text • Macrostructure (Intro., Body, Concl.) • Problem – Solution elements • Discourse-based moves (e.g. <obj>; <need>) • Different phraseologies for different sections • …to assess in detail the environmental impacts of …<obj> • ..in order to reduce potential noise impacts. <prso>
Role of Context in Interpretation Genre perspective • Goal-driven communicative event associated with particular discourse communities and disciplines • Handford (2010a) asks “how can we relate the specific instance (such as text, discourse move or lexico-grammatical item) to the wider social context within which it occurs … • Is it possible to interpret the corpus data as a reflection of the context, or conversely, is it possible to rely on contextual features for interpretation of the corpus data? (Flowerdew 2011)
Interpretation • Stubbs (2001a, 2001b) argues that conventional view that context-sensitive pragmatic markers meanings are usually inferred by speaker / hearer may be overstated; large-scale corpus studies show pragmatic meanings can be conventionally encoded in linguistic form • Tognini Bonelli (2004) considers it possible to “read off” discursive practices of a discourse community from recurring multiple concordance lines.
Interpretation Corpus of EIA reports • The problems associated with continued pollution… • Health hazards associated with proximity to high tension power lines… • It is expected that there will be no significant residual impacts… • Works at the tunnel portal will create a noise problem…
Interpretation Discursive practices vs. strategies (Handford 2010b) • Discursive practices: signify recurrent patterns of linguistic behaviour and “tie the communication to the wider social context” • Strategies: “merely describe what the individual is trying to achieve within the particular speech event” • Widdowson (2004: 60) points out difficult to assign pragmatic significance to phraseologies in one particular text.
Interpretation • Interpret data related to strategies with reference to not only other co-textual features but also to external contextual information. • Ethnographic perspective sometimes needed for interpretation of context-dependent pragmatically oriented features • Widdowson (2000: 60) remarks that corpus-based methods focus on the text as product and ‘cannot account for complex interplay of linguistic and contextual factors whereby discourse is enacted’.
Conclusion • No “tailor-made” corpora for teaching (Leech 2008); no “perfect” corpora for research • Corpus linguistic techniques one of approaches (ethnographic dimension) • Corpora are now being used in other applied linguistics areas: textlinguistics, genre analysis, CDA, sociolinguistics, SLA (Flowerdew in press, 2011a, b)