1 / 20

CORPUS ANNOTATION

CORPUS ANNOTATION. Extralinguistic annotation (‘mark-up’) and linguistic annotation (‘tagging’, ‘parsing’, etc.) Why is mark-up essential in corpus building? What is TEI? What are the advantages and the disadvantages of (linguistic) annotation?

michi
Download Presentation

CORPUS ANNOTATION

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CORPUS ANNOTATION • Extralinguistic annotation (‘mark-up’) and linguistic annotation (‘tagging’, ‘parsing’, etc.) • Why is mark-up essential in corpus building? • What is TEI? • What are the advantages and the disadvantages of (linguistic) annotation? • What are the main methods of corpus annotation? What are the benefits and drawbacks of each one? • What are the main uses of tagging? • Other kinds of annotation

  2. CORPUS ANNOTATION McEnery, T. R. Xiao and Y. Tono (2006), "Corpus mark-up" and "Corpus annotation", in Corpus-Based Language Studies: An Advanced Resource Book. London: Routledge, 22-45.

  3. Extralinguistic annotation (‘mark-up’) vs linguistic annotation (‘tagging’, ‘parsing’, etc.) • Extralinguistic annotation (‘mark-up’) • A system of standard codes inserted into an electronic document to provide information about the text. • Kinds of information provided by mark-up: • Internal organization of text: sections, paragraphs, sentences… • External (‘contextual’) information: source of text, authors, age, gender, textual category, number of speakers, etc. • Linguistic annotation (‘tagging’, ‘parsing’, etc.) • A system of standard codes inserted into an electronic document to provide linguistic information found in the text.

  4. Examples of Extralinguisticannotation (‘mark-up’) �<title> How we won the open: the caddies' stories. Sample containing about 36083 words from a book (domain: leisure) </title><!-- ASA-->�<title> Harlow Women's Institute committee meeting. Sample containing about 246 words speech recorded in public context</title>�<title> The Scotsman: Arts section. Sample containing about 48246 words from a periodical (domain: arts) </title>�<title>32 conversations recorded by `Frank' (PS09E) between 21 and 28 February 1992 with 9 interlocutors, totalling 3193 s-units, 20607 words, and 3 hours 22 minutes 23 seconds of recordings.</title>�<title>[Leaflets advertising goods and products]. Sample containing about 23409 words of miscellanea (domain: commerce)</title> �<person���age="Ag0"���dialect="XLO"���xml:id="PS5A1"���role="self"���sex="m"���soc="C2">��<name>Terry</name>��<age>14</age>��<occupation>student</occupation>��<dialect>London</dialect>�</person>

  5. Examples of linguistic annotation (‘tagging’, ‘parsing’, etc.) Word-class tagging in the BNC: apparently we eat more chocolate than any other country. <w c5="AV0" hw="apparently" pos="ADV">apparently </w> <w c5="PNP" hw="we" pos="PRON">we </w> <w c5="VVB" hw="eat" pos="VERB">eat </w> <w c5="DT0" hw="more" pos="ADJ">more </w> <w c5="NN1" hw="chocolate" pos="SUBST">chocolate </w> <w c5="CJS" hw="than" pos="CONJ">than </w> <w c5="DT0" hw="any" pos="ADJ">any </w> <w c5="AJ0" hw="other" pos="ADJ">other </w> <w c5="NN1" hw="country" pos="SUBST">country</w> <c c5="PUN">.</c>

  6. Examples of linguistic annotation (‘tagging’, ‘parsing’, etc.) Syntactic/grammatical/formal tagging in ICE-GB:

  7. Examples of linguistic annotation (‘tagging’, ‘parsing’, etc.) Syntactic/grammatical/formal tagging in ICE-GB:

  8. Why is mark-up essential in corpus building? • ‘Contextualization’ of texts in a corpus “Contextual information is needed to restore the context and to enable us to relate the specimen [i.e. the text] to its original habitat [i.e. its context]” • ‘Enrichment’ of raw data with textual (eg.: sentence boundaries and extra-textual information (source, author, age, sex “Mark-up adds value to a corpus and allows for a broader range of research questions to be addressed as a result.” • ‘Editing’ and “transcribing’: omissions (graphics, tables), foreign words, turn-taking, interruptions, overlappings, laughter

  9. What is TEI? • Mark-up schemes: COCOA, DCMI, OLAC, IMDI, CES, TEI (Text Encoding Initiative) • Aim of TEI: “to facilitate data exchange by standardizing the mark-up or encoding of information stored in electronic form.” • Example of use of TEI mark-up system: BNC BNC Header

  10. Linguistic annotation (tagging, parsing, etc.) • Linguistic information encoded within the corpus itself. • Like corpus mark-up, annotation adds value to a corpus: “Annotation is a crucial contribution to the benefit a corpus brings, since it enriches the corpus as a source of linguistic information for future research and development” (Leech, 1997, p.2) • As opposed to mark-up (which is ‘objective’), annotation is ‘interpretive’, i.e. implies a previous linguistic analysis or interpretation of text

  11. ADVANTAGES of (linguistic) annotation • Annotation facilitates the extraction of information from a corpus: Eg.: left, light, play; N, V; OD, OI, PP, Rel Cl • Speed of data extraction • Reliability • Reusability • Multifuncionality • Explicitness • Reference resource

  12. ‘left’ the the BNC

  13. DISADVANTAGES of (linguistic) annotation • Annotation ‘clutters’ corpora: “Howevermuchannotationisaddedto a text, itisimportantfortheresearchertobeabletoseetheplaintext, unclutteredbyannotationallabels. Thebasicpatterning of thewordsalonemustbe observable at all times.” (Hunston, 2002:94) <w c5="AV0" hw="apparently" pos="ADV">apparently </w> <w c5="PNP" hw="we" pos="PRON">we </w> <w c5="VVB" hw="eat" pos="VERB">eat </w> <w c5="DT0" hw="more" pos="ADJ">more </w> <w c5="NN1" hw="chocolate" pos="SUBST">chocolate </w> <w c5="CJS" hw="than" pos="CONJ">than </w> <w c5="DT0" hw="any" pos="ADJ">any </w> <w c5="AJ0" hw="other" pos="ADJ">other </w> <w c5="NN1" hw="country" pos="SUBST">country</w> <c c5="PUN">.</c> …apparentlyweeat more chocaltethananyother country.

  14. DISADVANTAGES of (linguistic) annotation? • Annotation imposes a linguistic analysis upon a corpus user: “Annotation should serve the needs of the corpus user, not determine the direction the investigation must take” (Hunston) Eg: OI in ICE: We gave THEMOI some food vs We gave some food TO THEMA ‘Dimonotransitive’ (dimontr) Dimonotransitive verbs (dimontr) are complemented by an Indirect Object only. They include show, ask, assure, grant, inform, promise, reassure, and tell. When I asked her, she burst into tears V(dimontr,past)   I’ll tell you tomorrow V(dimontr,infin) Show me V(dimontr,imp)

  15. DISADVANTAGES of (linguistic) annotation? • (Un)Reliability of annotation: accuracy / consistency

  16. The press swung heavily to the left • Center for Sprogteknologi (University of Copenhagen) (http://cst.dk/online/pos_tagger/uk/index.html) the/DT press/NN swung/VBD heavily/RB to/TO the/DT left/VBN • CLAWS tagger (http://ucrel.lancs.ac.uk/claws/trial.html) The_AT0 press_NN1 swung_VVD heavily_AV0 to_PRP the_AT0 left_AJ0 • Stanford parser (http://nlp.stanford.edu:8080/parser/) The/DT press/NN swung/VBD heavily/RB to/TO the/DT left/NN

  17. Methods of corpus annotation • AUTOMATIC • COMPUTER-ASSISTED (‘SEMI-AUTOMATIC’) • MANUAL

  18. Types of corpus annotation • Phonological: syllable boundaries, prosodic features (stress, tone, pitch) • Morphological: prefixes, suffixes, stems • Lexico-grammatical (‘tagging’): part of speech (N, V), grammatical features (Sing, Pl, Past), lemma • Syntactic (‘parsing’): phrases, clauses, syntactic functions • Semantic: semantic field • Textual-Discoursal: anaphoric relations, theme/rheme, given/new information • Pragmatic: speech acts • Stylistic • Etc.

  19. Types of corpus annotation • -- Tagging (POS tags): • Annotation at UCREL; • CLAWS (the tagger used for the BNC, TIME, BYU American Corpus, etc); • Tagging in BNC • -- Parsing: Annotation in the ICE-GB

  20. ‘What are the main uses of tagging?’ • Disambiguation and comparison of distribution/frequency/collocations of homographs: eg: left, light, play, deal • Distribution/Frequency of Word-Classes • Collocation of items with Word-classes (rather than with other individual items). • Sequences of word-classes • Etc.

More Related