190 likes | 276 Views
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES category-based methodology Bambang Kaswanti Purwo bkaswanti@atmajaya.ac.id. Adolphs, Svenja (2006) Ch. 2. text corpora – traditionally – consist of text only. to make a text usable and reusable for a wider research
E N D
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES category-based methodology Bambang Kaswanti Purwo bkaswanti@atmajaya.ac.id
Adolphs, Svenja (2006) Ch. 2 text corpora – traditionally – consist of text only to make a text usable and reusable for a wider research community add additional information to the text the information takes the form of “mark-up” and “annotation” three types of information added to the corpora text ▪ mark-up ▪ annotation ▪ metadata » mark-up information about typography and layout (textual features) ▪ speaker codes in a transcript of spoken data ▪ codes to mark headings or new paragraphs in written data (using “angle brackets”)
various mark-up systems currently in use ▪ SGML (Standard Generalized Mark-Up Language) ▪ related XML (Extensible Mark-Up Language) the pause is marked by a particular code in angle brackets ▪ an opening tag with an identifier symbol (number or letter) ▪ and a closing tag <$E> and <\$E> on either side of the word
<p-> begin paragraph <p/> end paragraph <h_> begin headline <hi > end headline <h I > one word headline <quote_> begin quotation <quote/> end quotation <quote I > one word quotation <tf_> begin typeface change <tf/> end typeface change <tf l> one word typeface change <*-><*/> begin/end unusable character <foreign_> begin foreign words <foreign/> end foreign words <foreign|> one foreign word <$1> That’s right. Er why do you think Twelve’s treated differently? <$E> pause <\$E> Why do you think+ <$4> Erm.
» annotation analytical information ▪ automatically by a software program ▪ in a semi-automated manner ▪ in manual manner represented with the use of codes annotation is a superordinate term for tagging and parsing (Hunston 2002, Ch 1) tagging: addition of a code to each word in a corpus, including the part of speech ▪ tags are codes added to each word in a text to identify which parts of speech (POS) individual words represent ▪ programs for automated POS annotation are now widely available and highly accurate the next slide is an example of a spoken turn in the CANCODE corpus
<$1> That’s right. Er why do you think Twelve’s treated differently? <$E> pause <\$E> Why do you think+ <$4> Erm.
findings with respect to the Part of Speech (POS) tagging (Hunston 2002) ▪ NNSs use more DETs, pronouns, adverbs fewer conjunctions, preposition, nouns ▪ NSs use more complex and abstracts nouns ▪ New Scientists corpus of the Bank of English ▫ work [v] 926 times per million words ▫ work [n] 654 times ▪ spoken corpus ▫ work [v] 1,060 times per million words ▫ work [n] 572 times [much less in comparison to the written corpus] ▪ [written] collocation of the noun – differences in meaning their work, the workof [most significant, in the possessives] ‘scientific discovery’; next frequent meaning ‘to describe what non-human entities (e.g. bacteria) do’ ▪ [spoken] significant collocates (loads of work, sort of work, at work) most frequent meaning ‘job’
And [Cand] the [Dthe] security [Nsg] guard [Nsg] was [VFpastBe] walking [VPpres] about [T] checking [VPpres] everything [Pind] was [VFpastBe] okay [Jbas] and [Cand] and [Cand] then... Key: Ubas] adjective, base; [Nsg] noun, singular; [Cand] conjunction, coordinating; [Dthe] definite article; [Nsg] noun, singular; [VFpastBe] verb, finite, past; [VPpres] verb, particle, present; [Pind] pronoun, indefinite.
parsing: analysis of text into constituents (clauses, groups) ▪ a parsed corpus can be used to count with great accuracy the number of different structures in a corpus ▪ parsing can be done automatically but the resulting output is often not very accurate » metadata ▪ information used in the representation of electronic texts ▪ metadata is ‘data about data’, information about the content, source, quality, and other characteristics of a particular text ▪ the data is useful when the corpus is shared and reused by the community ▪ metadata can be put in a separate database (encoded thru mark-up (meta)language, can be further extended by other users of the same data next slide: example of information about the source of the text
<encodingDesc> <projectDesc> <p>Texts were collected to illustrate the full range of twentieth- century spoken and written Swedish, written by native Swedish authors. </p> </projectDesc> <samplingDecl> <p> Sample of 2000 words taken from the beginning of the text.</p> </samplingDecl> <editorialDeci> <interpretation> <p>Errors in transcription controlled by using the SUC spell checker, v.2.4</p> </interpretation> </editorialDecl> </encodingDesc>
Hunston Ch. 4 ▪ using annotations to explore a corpus is referred to as a “category-based” methodology ▪ the parts of a corpus (the words, phonological units, clauses, etc.) are placed into categories ▪ the categories are used as the basis for corpus searches and statistical manipulation » tagging: allocating a part of speech (POS) label to each word in a corpus the tag can be chosen: general or specific information ▪ verb ▪ present participle of a verb ▪ present participle of the verb be ▪ present participle as auxiliary e.g. being [considered]
▪ singular common noun (deal) 115 ▪ proper noun (Deal) 1 ▪ plural common noun (deals) 5 ▪ base form of the verb (deal) 66 ▪ present participle (dealing) 51 ▪ 3rd ps sg form of the verb (deals) 20 ▪ past tense of the verb (dealt) 14 ▪ past participle (dealt) 17 deal not all forms of a lemma behave in the same way (although it cannot be proven that they always behave differently) ▪ the most frequent collocate of LIGHT [v] in the Bank of English: cigarette, came, fire, candle, candles ▪ LIGHT [n] has the collocates red, green, bright, traffic, flashing ▪ LIGHT [adj] collocates with dark, brown, blue, touch, very
▪ total occurrences of word-classes in a particular corpus can be counted (Table 4.1) ▪ nouns most common in news and academic prose least in conversation ▪ verbs and adverbs common in conversation ▪ in conversations speakers use more pronouns than nouns ▪ in news n academic prose: more nouns than pronouns ▪ because nouns often used with DET and PREP, high frequency of nouns also high with DET and PREP ▪ because AUX and particles co-occur with verbs, in conver- sations: verbs high frequency, so are AUX and particles
▪ Dutch, Finish, and French speakers writing English all use fewer of the following tag sentences than NS prep-article-noun (in the morning) article-noun-prep (a debate on) noun-prep-noun (part of speech) noun-prep-article (concern for the) NNS writers do not use prepositions in a “native-like” way NNS writers are using fewer of the lengthy NP that are essential to formal, particularly academic, writing in Eng • corpus tagging needs to be done automatically; the labor of adding tags by hand would outweigh the advantages of having them • taggers tend to work on a mixture of two principles: rule governing word-class and probability
• if light follows the DET a, it may be a noun or an adjective; unlikely to be a verb • when applying the rules fails to identify the word-class, many taggers use probability, based on the overall frequency of the word and word-class e.g. a program fails to identify an instance of deal: N or V? deal more frequently occurs as [n] than as [v]; N • automatic taggers usually claimed to have an accuracy rate over 90% (but the tagger may be wrong) » parsing analyzing the sentences in a corpus into their constituent parts • the parser identifies boundaries of sentences, clauses, and phrases • the parser assigns labels to the parts identified: adv clause, nominal clause, relative clause, adj phrase, prep phrase
• The victim’s friends told the police that Krueger drove into the quarry ▪ the whole is identified as a sentence ▪ the victim’s friends – a noun phrase, within which the victim’s is identified as genitive ▪ told the police that Krueger drove into the quarry – a verb phrase ▪ police – a noun phrase ▪ thatKrueger drove into the quarry – a nominal clause ▪ Krueger – a noun phrase ▪ drove into the quarry – a verb phrase ▪ into the quarry – a preposition phrase ▪ the quarry – a noun phrase (dependant on the prep) (1) Don’t sell harmful dairy products. (2) How harmful dairy products are!
▪ it is difficult for a parser to do the analysis completely accurately parsed corpora to edited by hand to achieve greater degree of accuracy ▪ parsed corpora are the basis for much of the statistical work that has been done on different registers ▪ Biber et al. (1998) examines the use of BEGIN n START in two small sub-corpora from the Longman-Lancaster Corpus: fiction and academic prose different ways of identifying the verb intransitive, transitive [+ a NP], + a to-clause, + an –ing clause? ▪ START [intransitive] 64% in academic prose ▪ BEGIN + a to-clause 72% in fiction; + ing only 4%
▪ the difference explainable with reference to the function of the uses in the register concerned ▪ [in academic prose] intransitive START is frequent indicating the start of a process (which is frequent in this text type) e.g. Blood loss started about the eighth day of infection … ▪ [in fiction] BEGIN followed by a to-clause, to describe the start of an action e.g. I began to move instinctively to my right ▪ or, a reaction to events e.g. I began to feel uneasy … ▪ see Table 4.2 some verbs (+ motion) move, walk, fall, run are used with complementation patterns some verbs of thinking n feeling (feel, think, wonder) are used in to-clause only
▪ one other use of a parsed corpus: to teach grammatical analysis to students (McEnery et al. 1997) ▪ McEnery et al’s study: students who have practiced analysis with a computer, using a parsed corpus, do better than equivalent students who have been taught by a human being ▪ the absence of a human judge might do much to reduce the level of anxiety often associated with learning how to do grammatical analysis