1 / 19

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES category-based methodology Bambang Kaswanti Purwo

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES category-based methodology Bambang Kaswanti Purwo bkaswanti@atmajaya.ac.id. Adolphs, Svenja (2006) Ch. 2. text corpora – traditionally – consist of text only. to make a text usable and reusable for a wider research

tacita
Download Presentation

ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES category-based methodology Bambang Kaswanti Purwo

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES category-based methodology Bambang Kaswanti Purwo bkaswanti@atmajaya.ac.id

  2. Adolphs, Svenja (2006) Ch. 2 text corpora – traditionally – consist of text only to make a text usable and reusable for a wider research community  add additional information to the text the information takes the form of “mark-up” and “annotation” three types of information added to the corpora text ▪ mark-up ▪ annotation ▪ metadata » mark-up information about typography and layout (textual features) ▪ speaker codes in a transcript of spoken data ▪ codes to mark headings or new paragraphs in written data (using “angle brackets”)

  3. various mark-up systems currently in use ▪ SGML (Standard Generalized Mark-Up Language) ▪ related XML (Extensible Mark-Up Language) the pause is marked by a particular code in angle brackets ▪ an opening tag with an identifier symbol (number or letter) ▪ and a closing tag <$E> and <\$E> on either side of the word

  4. <p-> begin paragraph <p/> end paragraph <h_> begin headline <hi > end headline <h I > one word headline <quote_> begin quotation <quote/> end quotation <quote I > one word quotation <tf_> begin typeface change <tf/> end typeface change <tf l> one word typeface change <*-><*/> begin/end unusable character <foreign_> begin foreign words <foreign/> end foreign words <foreign|> one foreign word <$1> That’s right. Er why do you think Twelve’s treated differently? <$E> pause <\$E> Why do you think+ <$4> Erm.

  5. » annotation analytical information ▪ automatically by a software program ▪ in a semi-automated manner ▪ in manual manner represented with the use of codes annotation is a superordinate term for tagging and parsing (Hunston 2002, Ch 1) tagging: addition of a code to each word in a corpus, including the part of speech ▪ tags are codes added to each word in a text to identify which parts of speech (POS) individual words represent ▪ programs for automated POS annotation are now widely available and highly accurate the next slide is an example of a spoken turn in the CANCODE corpus

  6. <$1> That’s right. Er why do you think Twelve’s treated differently? <$E> pause <\$E> Why do you think+ <$4> Erm.

  7. findings with respect to the Part of Speech (POS) tagging (Hunston 2002) ▪ NNSs use more DETs, pronouns, adverbs fewer conjunctions, preposition, nouns ▪ NSs use more complex and abstracts nouns ▪ New Scientists corpus of the Bank of English ▫ work [v] 926 times per million words ▫ work [n] 654 times ▪ spoken corpus ▫ work [v] 1,060 times per million words ▫ work [n] 572 times [much less in comparison to the written corpus] ▪ [written] collocation of the noun – differences in meaning their work, the workof [most significant, in the possessives] ‘scientific discovery’; next frequent meaning ‘to describe what non-human entities (e.g. bacteria) do’ ▪ [spoken] significant collocates (loads of work, sort of work, at work) most frequent meaning ‘job’

  8. And [Cand] the [Dthe] security [Nsg] guard [Nsg] was [VFpastBe] walking [VPpres] about [T] checking [VPpres] everything [Pind] was [VFpastBe] okay [Jbas] and [Cand] and [Cand] then... Key: Ubas] adjective, base; [Nsg] noun, singular; [Cand] conjunction, coordinating; [Dthe] definite article; [Nsg] noun, singular; [VFpastBe] verb, finite, past; [VPpres] verb, particle, present; [Pind] pronoun, indefinite.

  9. parsing: analysis of text into constituents (clauses, groups) ▪ a parsed corpus can be used to count with great accuracy the number of different structures in a corpus ▪ parsing can be done automatically but the resulting output is often not very accurate » metadata ▪ information used in the representation of electronic texts ▪ metadata is ‘data about data’, information about the content, source, quality, and other characteristics of a particular text ▪ the data is useful when the corpus is shared and reused by the community ▪ metadata can be put in a separate database (encoded thru mark-up (meta)language, can be further extended by other users of the same data next slide: example of information about the source of the text

  10. <encodingDesc> <projectDesc> <p>Texts were collected to illustrate the full range of twentieth- century spoken and written Swedish, written by native Swedish authors. </p> </projectDesc> <samplingDecl> <p> Sample of 2000 words taken from the beginning of the text.</p> </samplingDecl> <editorialDeci> <interpretation> <p>Errors in transcription controlled by using the SUC spell checker, v.2.4</p> </interpretation> </editorialDecl> </encodingDesc>

  11. Hunston Ch. 4 ▪ using annotations to explore a corpus is referred to as a “category-based” methodology ▪ the parts of a corpus (the words, phonological units, clauses, etc.) are placed into categories ▪ the categories are used as the basis for corpus searches and statistical manipulation » tagging: allocating a part of speech (POS) label to each word in a corpus the tag can be chosen: general or specific information ▪ verb ▪ present participle of a verb ▪ present participle of the verb be ▪ present participle as auxiliary e.g. being [considered]

  12. ▪ singular common noun (deal) 115 ▪ proper noun (Deal) 1 ▪ plural common noun (deals) 5 ▪ base form of the verb (deal) 66 ▪ present participle (dealing) 51 ▪ 3rd ps sg form of the verb (deals) 20 ▪ past tense of the verb (dealt) 14 ▪ past participle (dealt) 17 deal  not all forms of a lemma behave in the same way (although it cannot be proven that they always behave differently) ▪ the most frequent collocate of LIGHT [v] in the Bank of English: cigarette, came, fire, candle, candles ▪ LIGHT [n] has the collocates red, green, bright, traffic, flashing ▪ LIGHT [adj] collocates with dark, brown, blue, touch, very

  13. ▪ total occurrences of word-classes in a particular corpus can be counted (Table 4.1) ▪ nouns most common in news and academic prose least in conversation ▪ verbs and adverbs common in conversation ▪ in conversations speakers use more pronouns than nouns ▪ in news n academic prose: more nouns than pronouns ▪ because nouns often used with DET and PREP, high frequency of nouns  also high with DET and PREP ▪ because AUX and particles co-occur with verbs, in conver- sations: verbs high frequency, so are AUX and particles

  14. ▪ Dutch, Finish, and French speakers writing English all use fewer of the following tag sentences than NS prep-article-noun (in the morning) article-noun-prep (a debate on) noun-prep-noun (part of speech) noun-prep-article (concern for the)  NNS writers do not use prepositions in a “native-like” way NNS writers are using fewer of the lengthy NP that are essential to formal, particularly academic, writing in Eng • corpus tagging needs to be done automatically; the labor of adding tags by hand would outweigh the advantages of having them • taggers tend to work on a mixture of two principles: rule governing word-class and probability

  15. • if light follows the DET a, it may be a noun or an adjective; unlikely to be a verb • when applying the rules fails to identify the word-class, many taggers use probability, based on the overall frequency of the word and word-class e.g. a program fails to identify an instance of deal: N or V? deal more frequently occurs as [n] than as [v];  N • automatic taggers usually claimed to have an accuracy rate over 90% (but the tagger may be wrong) » parsing analyzing the sentences in a corpus into their constituent parts • the parser identifies boundaries of sentences, clauses, and phrases • the parser assigns labels to the parts identified: adv clause, nominal clause, relative clause, adj phrase, prep phrase

  16. • The victim’s friends told the police that Krueger drove into the quarry ▪ the whole is identified as a sentence ▪ the victim’s friends – a noun phrase, within which the victim’s is identified as genitive ▪ told the police that Krueger drove into the quarry – a verb phrase ▪ police – a noun phrase ▪ thatKrueger drove into the quarry – a nominal clause ▪ Krueger – a noun phrase ▪ drove into the quarry – a verb phrase ▪ into the quarry – a preposition phrase ▪ the quarry – a noun phrase (dependant on the prep) (1) Don’t sell harmful dairy products. (2) How harmful dairy products are!

  17. ▪ it is difficult for a parser to do the analysis completely accurately  parsed corpora to edited by hand to achieve greater degree of accuracy ▪ parsed corpora are the basis for much of the statistical work that has been done on different registers ▪ Biber et al. (1998) examines the use of BEGIN n START in two small sub-corpora from the Longman-Lancaster Corpus: fiction and academic prose different ways of identifying the verb intransitive, transitive [+ a NP], + a to-clause, + an –ing clause? ▪ START [intransitive] 64% in academic prose ▪ BEGIN + a to-clause 72% in fiction; + ing only 4%

  18. ▪ the difference explainable with reference to the function of the uses in the register concerned ▪ [in academic prose] intransitive START is frequent  indicating the start of a process (which is frequent in this text type) e.g. Blood loss started about the eighth day of infection … ▪ [in fiction] BEGIN followed by a to-clause,  to describe the start of an action e.g. I began to move instinctively to my right ▪ or,  a reaction to events e.g. I began to feel uneasy … ▪ see Table 4.2  some verbs (+ motion) move, walk, fall, run are used with complementation patterns  some verbs of thinking n feeling (feel, think, wonder) are used in to-clause only

  19. ▪ one other use of a parsed corpus: to teach grammatical analysis to students (McEnery et al. 1997) ▪ McEnery et al’s study: students who have practiced analysis with a computer, using a parsed corpus, do better than equivalent students who have been taught by a human being ▪ the absence of a human judge might do much to reduce the level of anxiety often associated with learning how to do grammatical analysis

More Related