1 / 18

20th February, 2014

Session 4: Annotation http://tinyurl.com/669o4zt. 20th February, 2014. Corpus Linguistics 2014. More than the Text: Annotation - what, why, how? Ylva Berglund Prytz and Martin Wynne IT Services http://tinyurl.com/669o4zt. You can only find what is in the corpus.

Download Presentation

20th February, 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Session 4: Annotation http://tinyurl.com/669o4zt 20th February, 2014

  2. Corpus Linguistics 2014 More than the Text: Annotation - what, why, how? Ylva Berglund Prytz and Martin Wynne IT Services http://tinyurl.com/669o4zt

  3. You can only find what is in the corpus so unless it is a feature of the text, or someone has included it, you cannot easily find it.

  4. What can you find automatically?(if not, what do you need to annotate?) • All instances of ‘work’ • All instances of ‘WORK’ (lemma) • All instances of ‘work’ as a verb • All instances of ‘work’ in fiction • All instances of ‘work’ spoken by women • All instances of ‘work’ at end of clause • All instances of ‘work’ in jokes

  5. Annotation: what is it? The practice of adding interpretative linguistic information to a corpus. It can be useful to differentiate: Metadata - information about the text Structural markup - information about the text structure Linguistic annotation - information about the linguistic categories identified in the text (but the terms are not always used in this way, or consistently at all...)

  6. Metadata in the British National Corpus An example text from the BNC: CA2.xml

  7. Types of linguistic annotation • Morphosyntactic / wordclass / part-of-speech (POS)‏ • Syntactic (e.g. phrase, clause, mood...) • Semantic • Pragmatic • Discourse • Phonetic • Phonological • …

  8. Hands-on Exercise 'Borrow': Exercise 1.5 'More search features' Exploring further (optional): look at collocates of commit and deed again, this time including inflected forms

  9. Potential problems with annotation It can: • be incorrect • be inconsistent • follow the ‘wrong’ theory • have the 'wrong' level of granularity • use the 'wrong' tag-set • introduce subjective interpretations

  10. How do you annotate? • What are you marking up? (POS, lemma, clause?) • How are you annotating? (manually/automatically?) • With which tag-set? (CLAWS, Penn Treebank?) • Format of annotation? (HTML, XML, Chat?) • Whose linguistic analysis? (mine, or a more established, standard and 'consensus-backed' way of doing it?) • How are you going to use the annotations for your analysis? • How are your annotations going to be shared with other researchers?

  11. Good practice in annotation • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should be linguistically consensual • Annotation should observe standards (Leech 2005)‏

  12. Example: non-standard A01 0010 The Fulton County Grand Jury said Friday an investigation A01 0020 of Atlanta's recent primary election produced "no evidence" that A01 0030 any irregularities took place. The jury further said in term-end A01 0040 presentments that the City Executive Committee, which had over-all A01 0050 charge of the election, "deserves the praise and thanks of the A01 0060 City of Atlanta" for the manner in which the election was conducted.

  13. Example: non-standard |SA01:1 the_AT Fulton_NP County_NN Grand_JJ Jury_NN said_VBD Friday_NR an_AT investigation_NN of_IN Atlanta's_NP$ recent_JJ primary_NN election_NN produced_VBD no_AT evidence_NN that_CS any_DTI irregularities_NNS took_VBD place_NN ._. |SA01:2 the_AT jury_NN further_RBR said_VBD in_IN term-end_NN presentments_NNS that_CS the_AT City_NN Executive_JJ Committee_NN ,_, which_WDT had_HVD over-all_JJ charge_NN of_IN the_AT election_NN ,_, deserves_VBZ the_AT praise_NN and_CC thanks_NNS of_IN the_AT City_NN of_IN Atlanta_NP for_IN the_AT manner_NN in_IN which_WDT the_AT election_NN was_BEDZ conducted_VBN ._.

  14. Example: standard? <text> <file id=A01> <p> <s c="0000003 002" n=00001> <w AT>The <w NP1>Fulton <w NN1>County <w JJ>Grand <w NN1>Jury <w VVD>said <w NPD1>Friday <w AT1>an <w NN1>investigation <w IO>of <w NP1>Atlanta<w GE>'s <w JJ>recent <w JJ>primary <w NN1>election <w VVD>produced <quote> <w AT>no <w NN1>evidence </quote> <w CST>that <w DD>any <w NN2>irregularities <w VVD>took <w NN1>place<c YSTP>.

  15. Example: standard <body> <pb n="85"/> <div1 n="2" type="u"> <head><s n="1"><w type="CRD" lemma="1937">1937</w></s> </head> <pb n="87"/> <div2 n="1" type="u"> <p><s n="2"><w type="NP0" lemma="joe">Joe </w><w type="CJC" lemma="and">and </w><w type="NP0" lemma="harry">Harry </w><w type="VVD" lemma="stand">stood </w><w type="PRP" lemma="on">on </w><w type="AT0" lemma="the">the </w><w type="NN1" lemma="platform">platform </w><w type="NN1" lemma="side">side </w><w type="PRP" lemma="by">by </w><w type="NN1" lemma="side">side</w><c type="PUN">.</c></s>

  16. Annotation standards? Use of standards can help to ensure successful: • interpretation, • interchange, • preservation, • incorporation into other resources, • processing by generic software. And is a way of resolving tricky encoding decisions, and of justifying and documenting your decisions, and not reinventing the wheel every time.

  17. Some online taggers Example sentence: “John was very offended by her remarks” Free CLAWS WWW trial service (http://ucrel.lancs.ac.uk/claws/trial.html) C5: John_NP0 was_VBD very_AV0 offended_AJ0 by_PRP her_DPS remarks_NN2 ._. C7: John_NP1 was_VBDZ very_RG offended_JJ by_II her_APPGE remarks_NN2 ._. Cognitive Computation Group , University of Illinois at Urbana-Champaign http://l2r.cs.uiuc.edu/~cogcomp/eoh/posdemo.html (NNP John) (VBD was) (RB very) (VBN offended) (IN by) (PP$ her) (NNS remarks) (. .) CST's Part-Of-Speech tagger (http://www.cst.dk/online/pos_tagger/uk/) John/NNP was/VBD very/RB offended/VBN by/IN her/PRP$ remarks/NNS ./. Infogistics tTAG (http://www.infogistics.com/posdemo.htm) ([ John_NNP ]) <: was_VBD :> very_RB offended_VBN by_IN ([ her_PRP$ remarks_NNS ])._.

  18. Next week: Creating a corpus Register via IT Services webpage: http://courses.it.ox.ac.uk/detail/OTA6 Reading tip: Developing Linguistic Corpora: a Guide to Good Practicehttp://www.ahds.ac.uk/litlangling/creating/guides/linguistic-corpora/

More Related