210 likes | 425 Views
Using corpora in translation studies. What is a corpus?*. A corpus is defined in terms of f orm purpose The word corpus is used to describe a collection of examples of language collected for linguistic study .
Whatis a corpus?* A corpus isdefined in termsof • form • purpose The word corpusisusedtodescribe a collectionofexamplesoflanguagecollectedforlinguisticstudy. It can alsodescribecollectionsoftextsstored and accessedelectronically. (Hunston:2002). Corpus planning and design isfunctionalto some linguisticpurpose. Itis on thisbasisthattexts are selected and stored, so thatthey can bestudiedquantitatively and qualitatively. *Ref. Text: Hunston S. Corpora in AppliedLinguistics 2002
What are corporausedfor? • Corpora are oftenusedforlanguageteaching and learning. Theygive information abouthow a languageworks. • Theyalso help calculate the relative frequencyofdifferentfeatures. • Exploringcorpora can help studentstoobservenuancesofusage and tomakecomparisonsbetweenlanguages. • Corpora are alsousedto investigate cultural attitudesexpressedthroughlanguage. • NB a corpus willnotgive information aboutwhethersomethingispossible or not, onlywhetheritisfrequent or not!
Usingcorpora in translation • Corpora are alsoused in translation. • Comparablecorporaallowto compare the useofapparentequivalents • Parallelcorporaallowtoseehowwords and phraseshavebeentranslated in the past. • Generalcorpora can beusedtoestablishnormoffrequency and usage.
What can a corpus do? • Corpus access software isusedtorearrange the information whichhasbeenstored so thatobservationsofvariouskinds can bemade. • Itisnot the corpus whichgivesnew information aboutlanguage. Itis the software whichgives newperspectives on whatisalreadyfamiliar. • Software packagesprocess data showing: • frequency, • phraseology • collocation.
Frequency • Corpus processing allowscomparisonsofwords in termsoffrequencylists. • Quiteobviously, grammarwords are more frequentthanlexicalwords. Thatexplainswhythey are found top of the list. • Frequencylists can beusefulforidentifyingdifferencesbetween the corpora. Butcomparisons can bemadeonlyif the corpora are comparable, i.e. iftheirlengthisapproximately the same.
Concordance • The mostfrequent way toaccess a corpus isthrough a concordancingprogram. • Concordancelinesbringtogetherinstancesofuseofwords or phrases, so thatregularities in use can beobserved. • Concordancesalso help tounderstandhownouns or adjectives are used
Collocation • Collocationis the tendencyofwordstoco-occur. • The collocatesof a given word are thosewordswhichoftenoccur in conjunction • Collocation can indicate pairsoflexicalitems, or the associationbetween a lexical word and itsfrequentgrammaticalenvironment. In the latter case, the termusediscolligation.
Typesofcorpora • A corpus isdesignedfor a particularpurpose. Consequently, the typeof corpus depends on itspurpose: • Specialized corpus • General corpus • Comparablecorpora • Parallelcorpora • Learner corpus • Historical or diachronic corpus • Monitor corpus
Specialized corpus: a corpus oftextsof a particulartype (editorials, academicarticles, lectures, essays, etc.). Specializedcorporareflect the typeoflanguage a researcherwantstoexplore. Youmayalsorestrict the corpus to a time frame, to a social setting, to a giventopic. • General corpus: is a corpus oftextsofmanytypes, ofwritten or spokenlanguage, or ofboth. A general corpus isusuallymuchlargerthan a specialized corpus. Sinceit can beusedto produce referencematerialsitissometimescalled a reference corpus.
Comparablecorpora: two or more corpora in differentlanguages, or in differentvarietiesof a language. They are designedtocontain the sameproportionoftexts (i.e. newspapertexts, essays, novels, conversations, etc.). They can beusedbytranslators and learnerstoidentifydifferences and equivalences in eachlanguage. • Parallelcorpora: two or more corpora in differentlanguages, containingtranslatedtexts, or textsproducedsimultaneously in two or more languages (e.g. EU texts). They can beusedbytranslators and learnerstofindpotentialequivalents in eachlanguage, and to investigate differencesbetweenlanguages.
Learner corpus: a collectionoftextsproducedbylearnersof a language. Itisusedtoidentifydifferencesamonglearners, frequency and typeofmistakes, etc. • Historical or diachronic corpus: a corpus oftextsfromdifferentperiodsoftime. Ithelpsto trace the developmentof a languageovertime. • Monitor corpus: a corpus usedtotrackcurrentchanges in a language. Itrapidlyincreases in size, sinceitisaddedannually, monthly, daily, etc. The proportionof text typeshastoremainconstant, so thateachyeariscomparablewitheveryother.
The useofcorporaisnotlimitedtoidentifying, quantifying and analyzingkeywords. The concordancelinesoffermanyinstancesofuseofwords or phrases, so that the user can observeregularities in usebymeansofseveralexamplesof the same word or phrase in itsnaturalcontext. • Calculatingcollocationmeansfinding the statysticaltendencyofwordstoco-occur, and collocationsalsoemphasize some metaphoricaluse. A goodexampleis the collocationsof the word shed, withlight, tears, blood, pounds, confidence, hair, skin, labour. In thiscontextsshedis a verb. As such, itsItalianequivalentmayvary, so collocates are different.
Shed light fare/gettare luce • Shedtears spargere lacrime • Shedblood spargere/versare sangue • Shedpounds perdere chili/peso • Shedskin perdere/mutare la pelle (fare la muda) • Shedconfidence ispirare fiducia • Shedhair perdere il pelo • shedlabour disfarsi della manodopera (licenziare)
Key terms • Type • Token • Hapax • Lemma • Word-form • Tagging • Parsing • Annotate
Tokens: the termisusedto indicate the wordswhich are counted in a corpus or in a given text. • Butmanyofthesewordsoccur more than once. So, ifwecounteachrepeated item once only, the total numberchanges. In a given text, forinstance, wehave 250 tokens, but 194 types (articles, repeatednouns etc. are counted once only). Hapaxlegomenaor hapaxes are thosewordswhichoccuronly once. • Wemayalsohavewordswhichoccur in two (or more) differentforms: friend and friends, forinstance. These are twoword-formswhichbelongto the samelemma. The sameisforgo, goes, going, went, gone: fiveword-formswhichbelongto the same lemma, go. Thisimpliesthatwhenusing the lemma as a keyword, allitsdifferentword-formshavetobelookedfor.
Usuallyword-forms are consideredtobelongto the same lemma whentheybelongto the sameword-class (verb, noun, adjective, etc.) • Taggingusuallyrefersto the additionof a code toeach word in a corpus, to indicate the part ofspeech. Automatictaggingispossible, butnotfully accurate. Taggingisusefulwhenyouwantto look at different word categories. Forinstance, the nounwork can beconsideredseparatelyfrom the verb.
Corpus parsingis the analysisof a text constituents, forinstanceclauses, and groups. Thisallowsyoutoanalyse the differentstructures in a corpus. • Just liketagging, parsing can bedoneautomatically, though the output isnotvery accurate. Manual editing isoftennecessary.