1 / 25

LELA 30922 Lecture 5

LELA 30922 Lecture 5. Corpus annotation and SGML See esp. R Garside, G Leech & A McEnery (eds) Corpus Annotation , London (1997) Longman, ch. 1 “Introduction” by G Leech; something similar available at http://llc.oxfordjournals.org/cgi/reprint/8/4/275.pdf

ferrol
Download Presentation

LELA 30922 Lecture 5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LELA 30922Lecture 5 Corpus annotation and SGML See esp. R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997) Longman, ch. 1 “Introduction” by G Leech; something similar available at http://llc.oxfordjournals.org/cgi/reprint/8/4/275.pdf CM Sperberg-McQueen and L Burnard (eds) Guidelines for Electronic Text Encoding and Interchange, ch. 2 “A Gentle Introduction to SGML”, available at http://www-sul.stanford.edu/tools/tutorials/html2.0/gentle.html

  2. Annotation • Difference between a corpus and a “mere collection of texts” is mainly due to the value added by annotation • Includes generic information about the text, usually stored in a “header” • But more significantly, annotations within the text itself

  3. Why annotate? • Adds information • Reflects some analysis of text • Inasmuch as this may reflect commitment to some theoretical approach, this can be a barrier sometimes (but see later) • Increases usefulness/reusability of text • Multi-functionality • May make corpus usable for something not originally foreseen by its compilers

  4. Golden rules of annotation • Recoverability • It should always be possible to ignore the annotation and reconstruct the corpus in its raw form • Extricability • Correspondingly, annotations should be easily accessible so they can be stored separately if necessary (“Before and after” versions) • Transparency: documentation • Purpose and meaning of annotations • How (eg manually or automatically), where and by whom annotations were done • If automatic, information about the programs used • Quality indication • Annotations almost inevitably include some errors or inconsistencies • To what extent have annotations been checked? • What is the measured accuracy rate, and against what benchmark?

  5. Theory-neutrality • Schools of thought • Annotations may reflect a particular theoretical approach, and this should be acknowledged • Consensus • corpus annotations which are more (rather than less)theory-neutral will be more widely used • given the amount of work involved, it pays to be aware of the descriptive traditions of the relevant field • Standards • There are very few absolute standards, but some schemes can become de facto standards through widespread use • For example, BNC designers were aware of the likely side effects of any decisions (regarding annotation) that they took

  6. Types of annotation • Plain corpus: it appears in its existing raw state of plain text • Corpus marked up for formatting attributes e.g. page breaks, paragraphs, font sizes • Corpus annotated with identifying information, such as title, author, genre, register, edition date • Corpus annotated with linguistic information • Corpus annotated with additional interpretive information, eg error analysis in learner corpus

  7. Levels of linguistic annotation • Paragraph and sentence-boundary disambiguation • Naive fullstop+space+capital unreliable for genuine texts • May also involve distinguishing titles/headings from running text • Tokenization: identification of lexical units • multi-word units, cliticised words (eg can’t) • Lemmatisation: identification of lemmas (or lexemes) • Makes accessible variants of lexemes for more generic searches • May involve some disambiguation (eg rose)

  8. Levels of linguistic annotation • POS tagging (grammatical tagging) • assigning to each lexical unit a code indicating its part of speech • most basic type of linguistic corpus annotation and forms an essential foundation for further forms of analysis • Parsing (treebanking) • Identification of syntactic relationships between words • Semantic tagging • Marking of word senses (sense resolution) • Marking of semantic relationships eg agent, patient • Marking with semantic categories eg human, animate

  9. Levels of linguistic annotation • Discourse annotation • especially for transcribed speech • Identifying discourse function of text eg apology, greeting • or other pragmatic aspects, eg politeness level, • Anaphoric annotation • Identification of pronoun reference • and other anaphoric links (eg different references to the same entity) • Phonetic transcription (only in spoken language corpora) • Indication of details of pronunciation not otherwise reflected in transcription eg weak forms, • Explicit indication of accent/dialect features eg vowel qualities, allophonic variation • Prosodic annotation (only in spoken language corpora) • Suprasegmental iformation, eg stress, intonation, rhythm

  10. Some examples PROSODIC ANNOTATION, LONDON-LUND CORPUS: well ^very nice of you to ((come and)) _spare the !t\/ime and # ^come and !t\alk # - ^tell me a’bout the - !pr\oblems# And ^incidentally# . ^I [@:] ^do ^do t\ell me# ^anything you ‘want about the :college in ”!g\eneral Source: Leech chapter in Garside et al. 1997

  11. EXAMPLE OF PART-OF-SPEECH TAGGING, LOB CORPUS: hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD comparatively_RB little_AP to_TO sing_VB '_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD Rollinson_NP ._. EXAMPLE OF PART-OF-SPEECH TAGGING, LOB CORPUS: hospitality_NN is_BEZ an_AT excellent_JJ virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS ,_, whose_WP$ chief_JJB scene_NN was_BEDZ cut_VBN at_IN the_ATI last_AP moment_NN ,_, had_HVD comparatively_RB little_AP to_TO sing_VB '_' he_PP3A stole_VBD my_PP$ wallet_NN !_! '_' roared_VBD Rollinson_NP ._. EXAMPLE OF SKELETON PARSING, FROM THE SPOKEN ENGLISH CORPUS: [S[N Nemo_NP1 ,_, [N the_AT killer_NN1 whale_NN1 N] ,_, [Fr[N who_PNQS N][V 'd_VHD grown_VVN [J too_RG big_JJ [P for_IF [N his_APP$ pool_NN1 [P on_II [N Clacton_NP1 Pier_NNL1 N]P]N]P]J]V]Fr]N] ,_, [V has_VHZ arrived_VVN safely_RR [P at_II [N his_APP$ new_JJ home_NN1 [P in_II [N Windsor_NP1 [ safari_NN1 park_NNL1 ]N]P]N]P]V] ._. S] Source: http://ucrel.lancs.ac.uk/annotation.html

  12. ANAPHORIC ANNOTATION OF AP NEWSWIRE S.1 The state Supreme Court has refused to release Rahway State Prison inmate James Scott on bail. S.2 The fighter is serving 30-40 years for a 1975 armed robbery conviction. S.3 Scott had asked for freedom while he waits for an appeal decision. S.4 Meanwhile, his promoter, Murad Muhammed, said Wednesday he netted only $15,250 for Scott's nationally televised light heavyweight fight against ranking contender Yaqui Lopez last Saturday. S.5 The fight, in which Scott won a unanimous decision over Lopez, grossed $135,000 for Muhammed's firm, Triangle Productions of Newark, he said. S.1 (0) The state Supreme Court has refused to release {1 [2 Rahway State Prison 2] inmate 1}} (1 James Scott 1) on bail . S.2 (1 The fighter 1) is serving 30-40 years for a 1975 armed robbery conviction . S.3 (1 Scott 1) had asked for freedom while <1 he waits for an appeal decision . S.4 Meanwhile , [3 <1 his promoter 3] , {{3 Murad Muhammed 3} , said Wednesday <3 he netted only $15,250 for (4 [1 Scott 1] 's nationally televised light heavyweight fight against {5 ranking contender 5}} (5 Yaqui Lopez 5) last Saturday 4) . S.5 (4 The fight , in which [1 Scott 1] won a unanimous decision over (5 Lopez 5) 4) , grossed $135,000 for [6 [3 Muhammed 3] 's firm 6], {{6 Triangle Productions of Newark 6} , <3 he said . Source: http://ucrel.lancs.ac.uk/annotation.html

  13. SGML • Although none of the examples just shown use it, for all but the simplest of mark-up schemes, SGML is widely recommended and used • SGML = standard generalized mark-up language • Actually suitable for all sorts of things, including web pages (HTML is SGML-conformant)

  14. What is a mark-up language? • Mark-up historically referred to printer’s marks on a manuscript to indicate typesetting requirements. • Now covers all sorts of codes inserted into electronic texts to govern formatting, printing, or other information. • Mark-up, or (synonymously) encoding, is defined as any means of making explicit an interpretation of a text. • By “mark-up language” we mean a set of mark-up conventions used together for encoding texts. Language must specify • what mark-up is allowed • what mark-up is required • how mark-up is to be distinguished from text • what the mark-up means • SGML provides the means for doing the first three • Separate documentation/software is required for the last • eg (1) difference between identifying something as <emph>and how that appears in print; (2) why something may or may not be tagged as a “relative clause”

  15. Rules of SGML • SGML allows us to define • Elements • Specific features of elements • Hierarchical/structural relations between elements • These specified in a “document type definition” (DTD) • DTD allows software to be written to • Help annotators annotate consistently • Explore documents marked-up

  16. Elements in SGML • Have a (unique) name • Semantics of name are application dependent • up to designer to choose appropriate name, but nothing automatically follows from the choice of any particular name • Each element must be explicitly marked or tagged in some way • Most usual is with <element>and </element>pairs, called start- and end-tags • Much SGML-compliant software seems to allow start-only tags • &element; (esp. useful for single words or characters) • _tag suffix

  17. Attributes • Elements can have named attributes with associated values • When defined, values can be identified as • #REQUIRED: must be specified • #IMPLIED: optional • #CURRENT: inferred to be the same as the last specified value for that attribute • Values can be from a predefined list, or can be of a general type (string, integer, etc)

  18. DTD (Document type definition) • Helps to impose uniformity over the corpus • Defines the (expected or to-be-imposed) structure of the document • For each element, defines • How it appears (whether end tags are required) • What its substructure is, ie what elements, how many of them, whether compulsory or not

  19. Example of DTD <!ELEMENT anthology - - (poem+)> <!ELEMENT poem - - (title?, stanza+ | couplet+)> <!ELEMENT title - O (#PCDATA) > <!ELEMENT stanza - O (line+) > <!ELEMENT couplet – O (cline, cline) > <!ELEMENT (line | cline) O O (#PCDATA) > • Start and end tags necessary (-) or optional (O) • Anthology consists of 1 or more poems • Poem has an optional title, then 1 or more stanzas or 1 or more couplets • Title consists of “parsed character data”, ie normal text • Stanza has one or more lines, couplet has two lines • Both lines and clines have the same definition: normal text

  20. Attributes <!ATTLIST poem id ID #IMPLIED status (draft | revised | published) draft > • DTD defines the attributes expected/required for each element • A poem has an id and a status • Value of id is any identifier, and is optional • Status is one of three values, default draft

  21. <anthology> <poem id=12 status=revised> <title>It’s a grand old team</title> <stanza> <line>It’s a grand old team to play for <line>It’s a grand old team to support <line>And if you know your history <line>It’s enough to make your heart go Whoooooah </stanza> </poem> <poem id=13> ... </poem> </anthology>

  22. Mark-up exemplified RAW TEXT: Two men retained their marbles, and as luck would have it they're both roughie-toughie types as well as military scientists - a cross between Albert Einstein and Action Man! TOKENIZED TEXT: <w orth=CAP>Two</w> <w>men</w> <w>retained</w> <w>their</w> <w>marbles<c PUN>,</c> <w>and</w> <w>as</w> <w>luck</w> <w>would</w> <w>have</w> <w>it</w> <w>they</w><w>'re</w> <w>both</w> <w>roughie-toughie</w> <w>types</w> <w>as</w> <w>well</w> <w>as</w> <w>military</w> <w>scientists <c PUN>&mdash;</c></w> <w>a</w> <w>cross</w> <w>between</w> <w orth=CAP>Albert</w> <w orth=CAP>Einstein</w> <w>and</w> <w orth=CAP>Action</w> <w orth=CAP>Man<c PUN>!</c>

  23. LEMMATIZED TEXT: <w orth=CAP>Two</w> <w lem=man>men</w> <w lem=retain>retained</w> <w>their</w> <w lem=marble>marbles<c PUN>,</c> <w>and</w> <w>as</w> <w>luck</w> <w>would</w> <w>have</w> <w>it</w> <w>they</w><w lem=be>'re</w> <w>both</w> <w>roughie-toughie</w> <w lem=type>types</w> <w>as</w> <w>well</w> <w>as</w> <w>military</w> <w lem=scientist>scientists</w> <c PUN>&mdash;</c> <w>a</w> <w>cross</w> <w>between</w> <w orth=CAP>Albert</w> <w orth=CAP>Einstein</w> <w>and</w> <w orth=CAP>Action</w> <w orth=CAP>Man</w><c PUN>!</c>

  24. POS TAGGED TEXT: <w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w> <w VVD lem=retain>retained</w> <w DPS>their</w> <w NN2 lem=marble>marbles</w><c PUN>,</c> <w CJC>and</w> <w CJS>as</w> <w NN1-VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w PNP>it</w> <w PNP>they</w><w VBB lem=be>'re</w> <w AV0>both</w> <w AJ0>roughie-toughie</w> <w NN2>types</w> <w AV0>as</w> <w AV0>well</w> <w CJS>as</w> <w AJ0>military</w> <w NN2>scientists</w> <c PUN>&mdash</c> <w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <w NP0>Albert</w> <w NP0>Einstein</w> <w CJC>and</w> <w NN1>Action</w> <w NN1-NP0>Man<c PUN>!</c>

  25. POS TAGGED TEXT with idioms and named entities: <w orth=CAP CRD>Two</w> <w NN2 lem=man>men</w> <phrase type=idiom><w VVD lem=retain>retained</w> <w DPS>their</w> <w NN2 lem=marble>marbles</w></phrase><c PUN>,</c> <w CJC>and</w> <phrase type=idiom><w CJS>as</w> <w NN1-VVB>luck</w> <w VM0>would</w> <w VHI>have</w> <w PNP>it</w></phrase> <w PNP>they</w><w VBB lem=be>'re</w> <w AV0>both</w> <w AJ0>roughie-toughie</w> <w NN2>types</w> <phrase type=compound pos=CJS><w AV0>as</w> <w AV0>well</w> <w CJS>as</w></phrase> <phrase type=compound pos=NN2><w AJ0>military</w> <w NN2>scientists</w></phrase> <c PUN>&mdash</c> <w AT0>a</w> <w NN1>cross</w> <w PRP>between</w> <phrase type=compound pos=NP0><w NP0>Albert</w> <w NP0>Einstein</w></phrase> <w CJC>and</w> <phrase type=compound pos=NP0><w NN1>Action</w> <w NN1-NP0>Man</phrase><c PUN>!</c>

More Related