260 likes | 531 Views
Using corpora in translation studies. What is a corpus?*. A corpus is defined in terms of f orm purpose The word corpus is used to describe a collection of examples of language collected for linguistic study .
E N D
Whatis a corpus?* A corpus isdefined in termsof • form • purpose The word corpusisusedtodescribe a collectionofexamplesoflanguagecollectedforlinguisticstudy. It can alsodescribecollectionsoftextsstored and accessedelectronically. (Hunston:2002). Corpus planning and design isfunctionalto some linguisticpurpose. Itis on thisbasisthattexts are selected and stored, so thatthey can bestudiedquantitatively and qualitatively. *Ref. Text: Hunston S. Corpora in AppliedLinguistics 2002
What are corporausedfor? • Corpora are often used for language teaching and learning. They give information about how a language works. • They also help calculate the relative frequency of different features. • Exploring corpora can help students to observe nuances of usage and to make comparisons between languages. • Corpora are also used to investigate cultural attitudes expressed through language. • NB a corpus will not give information about whether something is possible or not, only whether it is frequent or not!
Usingcorpora in translation • Corpora are also used in translation. • Comparable corpora allow to compare the use of apparent equivalents • Parallel corpora allow to see how words and phrases have been translated in the past. • General corpora can be used to establish norm of frequency and usage.
What can a corpus do? • Corpus access software isusedtorearrange the information whichhasbeenstored so thatobservationsofvariouskinds can bemade. • Itisnot the corpus whichgivesnew information aboutlanguage. Itis the software whichgives newperspectives on whatisalreadyfamiliar. • Software packagesprocess data showing: • frequency, • phraseology • collocation.
Frequency • Corpus processing allowscomparisonsofwords in termsoffrequencylists. • Quiteobviously, grammarwords are more frequentthanlexicalwords. Thatexplainswhythey are found top of the list. • Frequencylists can beusefulforidentifyingdifferencesbetween the corpora. Butcomparisons can bemadeonlyif the corpora are comparable, i.e. iftheirlengthisapproximately the same.
Concordance • The most frequent way to access a corpus is through a concordancing program. • Concordance lines bring together instances of use of words or phrases, so that regularities in use can be observed. • Concordances also help to understand how nouns or adjectives are used
Collocation • Collocation is the tendency of words to co-occur. • The collocates of a given word are those words which often occur in conjunction • Collocation can indicate pairs of lexical items, or the association between a lexical word and its frequent grammatical environment. In the latter case, the term used is colligation.
Typesofcorpora • A corpus isdesignedfor a particularpurpose. Consequently, the typeof corpus depends on itspurpose: • Specialized corpus • General corpus • Comparablecorpora • Parallelcorpora • Learner corpus • Historical or diachronic corpus • Monitor corpus
Specialized corpus: a corpus oftextsof a particulartype (editorials, academicarticles, lectures, essays, etc.). Specializedcorporareflect the typeoflanguage a researcherwantstoexplore. Youmayalsorestrict the corpus to a time frame, to a social setting, to a giventopic. • General corpus: is a corpus oftextsofmanytypes, ofwritten or spokenlanguage, or ofboth. A general corpus isusuallymuchlargerthan a specialized corpus. Sinceit can beusedto produce referencematerialsitissometimescalled a reference corpus.
Comparablecorpora: two or more corpora in differentlanguages, or in differentvarietiesof a language. They are designedtocontain the sameproportionoftexts (i.e. newspapertexts, essays, novels, conversations, etc.). They can beusedbytranslators and learnerstoidentifydifferences and equivalences in eachlanguage. • Parallelcorpora: two or more corpora in differentlanguages, containingtranslatedtexts, or textsproducedsimultaneously in two or more languages (e.g. EU texts). They can beusedbytranslators and learnerstofindpotentialequivalents in eachlanguage, and to investigate differencesbetweenlanguages.
Learner corpus: a collectionoftextsproducedbylearnersof a language. Itisusedtoidentifydifferencesamonglearners, frequency and typeofmistakes, etc. • Historical or diachronic corpus: a corpus oftextsfromdifferentperiodsoftime. Ithelpsto trace the developmentof a languageovertime. • Monitor corpus: a corpus usedtotrackcurrentchanges in a language. Itrapidlyincreases in size, sinceitisaddedannually, monthly, daily, etc. The proportionof text typeshastoremainconstant, so thateachyeariscomparablewitheveryother.
The use of corpora is not limited to identifying, quantifying and analyzing keywords. The concordance lines offer many instances of use of words or phrases, so that the user can observe regularities in use by means of several examples of the same word or phrase in its natural context. • Calculating collocation means finding the statystical tendency of words to co-occur, and collocations also emphasize some metaphorical use. A good example is the collocations of the word shed, with light, tears, blood, pounds, confidence, hair, skin, labour. In this contexts shed is a verb. As such, its Italian equivalent may vary, so collocates are different.
Shed light fare/gettare luce • Shed tears spargere lacrime • Shed blood spargere/versare sangue • Shed pounds perdere chili/peso • Shed skin perdere/mutare la pelle (fare la muda) • Shed confidence ispirare fiducia • Shed hair perdere il pelo • shed labour disfarsi della manodopera (licenziare)
Key terms • Type • Token • Hapax • Lemma • Word-form • Tagging • Parsing • Annotate
Tokens: the termisusedto indicate the wordswhich are counted in a corpus or in a given text. • Butmanyofthesewordsoccur more than once. So, ifwecounteachrepeated item once only, the total numberchanges. In a given text, forinstance, wehave 250 tokens, but 194 types (articles, repeatednouns etc. are counted once only). Hapaxlegomenaor hapaxes are thosewordswhichoccuronly once. • Wemayalsohavewordswhichoccur in two (or more) differentforms: friend and friends, forinstance. These are twoword-formswhichbelongto the samelemma. The sameisforgo, goes, going, went, gone: fiveword-formswhichbelongto the same lemma, go. Thisimpliesthatwhenusing the lemma as a keyword, allitsdifferentword-formshavetobelookedfor.
Usually word-forms are considered to belong to the same lemma when they belong to the same word-class (verb, noun, adjective, etc.) • Tagging usually refers to the addition of a code to each word in a corpus, to indicate the part of speech. Automatic tagging is possible, but not fully accurate. Tagging is useful when you want to look at different word categories. For instance, the noun work can be considered separately from the verb.
Corpus parsing is the analysis of a text constituents, for instance clauses, and groups. This allows you to analyse the different structures in a corpus. • Just like tagging, parsing can be done automatically, though the output is not very accurate. Manual editing is often necessary.