250 likes | 379 Views
Warm-up exercise. “Are we going to work together on this?” Assign word class /part-of-speech to each of the words in the sentence above. We will compare and discuss later. Corpus Linguistics (4). More than the text: Annotation - what, why, how? Ylva Berglund Prytz and Martin Wynne OUCS
E N D
Warm-up exercise “Are we going to work together on this?” Assign word class /part-of-speech to each of the words in the sentence above. We will compare and discuss later...
Corpus Linguistics (4) More than the text: Annotation - what, why, how? Ylva Berglund Prytz and Martin Wynne OUCS http://tinyurl.com/669o4zt
You can only find what is in the corpus so unless someone has included it you cannot (easily) find it.
What can you find automatically?(if not, what do you need to annotate?) • All instances of ‘work’ • All instances of ‘WORK’ (lemma) • All instances of ‘work’ as a verb • All instances of ‘work’ in fiction • All instances of ‘work’ spoken by women • All instances of ‘work’ at end of clause • All instances of ‘work’ in jokes
Annotation: what? • The practice of adding interpretative linguistic information to a corpus. • The result of (1) It is useful to differentiate: Metadata - information about the text Structural markup - information about the text structure Linguistic annotation - information about the linguistic categories identified in the text (but the terms are not always used in this way, or consistently at all...)
Linguistic annotation: what? • Morphosyntactic / wordclass / part-of-speech (POS) • Lexical • Syntactic • Semantic • Pragmatic • Discourse • Phonetic • Phonological • …
Warm-up exercise “Are we going to work together on this?” Assign word class /part-of-speech to each of the words in the sentence above. Search BNC
Annotation – What?Problems with annotation • It can: • be incorrect • be inconsistent • follow the ‘wrong’ theory • have the wrong level of granularity • use the 'wrong' tag-set • introduce subjective interpretations
Annotation: less than the text? “Annotation of a text is a procedure which loses information. There is no point in arguing that the information is in the computer's memory somewhere - annotation is the substitution of a general category for a specific item, and with respect to that area of the classification, the item has lost its uniqueness.” (John Sinclair, personal communication, 2001)
Annotation - Why?Benefits of annotation • It enables certain types of search and analysis, especially beyond the word form (e.g. “search for all inflected forms of cause as a verb”) • It can be the foundation for further automatic analysis of a corpus (e.g. POS tags can be used for parsing) • Preserving the analysis, enabling replicability of research and reusability of the corpus
Annotation – How?Before you can annotate, decide: • What are you marking up? (POS, lemma, clause?) • How are you annotating? (manually/automatically?) • With which tag-set? (CLAWS, Penn Treebank?) • Format of annotation? (HTML, XML, Chat?) • Whose linguistic analysis? (mine, or a de facto standard?) • How are your annotations going to be shared with other researchers?
Same words, different tags ... <w PNP>it <w VBZ>is <w AJ0>true <w CJT>that <w PNP>he <w VBD>was ... (BNC ABU:1683) it_PP3 is_BEZ true_JJ that_CS he_PP3A was_BEDZ .... (LOB G 28:95) • format of the tags (<w >) ( _ ) • it = PNP / PP3 (third person singular pronoun) • is = VBZ / BEZ (third person present tense form of BE) • that = CJT (the subordinating conjunction ‘that’) / CS (subordinating conjunction) • he = PNP (personal pronoun) / PP3A (personal pronoun, 3rd pers sing nom (he, she))
Annotation – How?How do you get the information in? • Manually • Automatically • Semi-automatically • Automatically based on manually annotated ‘training corpus’
Annotation – How?A simple tagger 1. Choose POS from list 2. When >1, check • frequency • surrounding words • n-grams • probability • rules • combination of the above 3. Choose best match (4. Note uncertainty) Wordlist cat = noun dog = noun sing = verb the= article will = verb will = noun
Annotation – How?Sample taggers • CLAWS http://ucrel.lancs.ac.uk/claws/trial.html (part of Wmatrix corpus analysis and comparison tool http://ucrel.lancs.ac.uk/wmatrix/ ) • CST http://cst.dk/online/pos_tagger/uk/
Annotation: how? • Annotations should be separable • Detailed and explicit documentation should be provided • Annotation practices should be linguistically consensual • Annotation should observe standards (Leech 2005)
Example: non-standard A01 0010 The Fulton County Grand Jury said Friday an investigation A01 0020 of Atlanta's recent primary election produced "no evidence" that A01 0030 any irregularities took place. The jury further said in term-end A01 0040 presentments that the City Executive Committee, which had over-all A01 0050 charge of the election, "deserves the praise and thanks of the A01 0060 City of Atlanta" for the manner in which the election was conducted.
Example: non-standard |SA01:1 the_AT Fulton_NP County_NN Grand_JJ Jury_NN said_VBD Friday_NR an_AT investigation_NN of_IN Atlanta's_NP$ recent_JJ primary_NN election_NN produced_VBD no_AT evidence_NN that_CS any_DTI irregularities_NNS took_VBD place_NN ._. |SA01:2 the_AT jury_NN further_RBR said_VBD in_IN term-end_NN presentments_NNS that_CS the_AT City_NN Executive_JJ Committee_NN ,_, which_WDT had_HVD over-all_JJ charge_NN of_IN the_AT election_NN ,_, deserves_VBZ the_AT praise_NN and_CC thanks_NNS of_IN the_AT City_NN of_IN Atlanta_NP for_IN the_AT manner_NN in_IN which_WDT the_AT election_NN was_BEDZ conducted_VBN ._.
Example: standard? <text> <file id=A01> <p> <s c="0000003 002" n=00001> <w AT>The <w NP1>Fulton <w NN1>County <w JJ>Grand <w NN1>Jury <w VVD>said <w NPD1>Friday <w AT1>an <w NN1>investigation <w IO>of <w NP1>Atlanta<w GE>'s <w JJ>recent <w JJ>primary <w NN1>election <w VVD>produced <quote> <w AT>no <w NN1>evidence </quote> <w CST>that <w DD>any <w NN2>irregularities <w VVD>took <w NN1>place<c YSTP>.
Example: standard <body> <pb n="85"/> <div1 n="2" type="u"> <head><s n="1"><w type="CRD" lemma="1937">1937</w></s> </head> <pb n="87"/> <div2 n="1" type="u"> <p><s n="2"><w type="NP0" lemma="joe">Joe </w><w type="CJC" lemma="and">and </w><w type="NP0" lemma="harry">Harry </w><w type="VVD" lemma="stand">stood </w><w type="PRP" lemma="on">on </w><w type="AT0" lemma="the">the </w><w type="NN1" lemma="platform">platform </w><w type="NN1" lemma="side">side </w><w type="PRP" lemma="by">by </w><w type="NN1" lemma="side">side</w><c type="PUN">.</c></s>
Annotation – How?Annotation standards? Use of standards can help to ensure successful: • interpretation, • interchange, • preservation, • incorporation into other resources, • processing by generic software. And is a way of resolving tricky encoding decisions, and of justifying and documenting your decisions.
More annotation • Not only POS with tags in text • “Stand-off” annotation • Linguistic, non-linguistic, structural, interpretative, several layers, etc • Always remember: • What is being annotated (basis for analysis)? • How to get it in? (adding annotation) • How to get it out (retrieving/using annotation)
Tip of the Week • UAM CorpusTool (annotate text with your own scheme) http://www.wagsoft.com/CorpusTool/ • Bookmarks for Corpus-based Linguists http://tiny.cc/corpora
Next week: Creating a corpus • Register via OUCS webpage http://www.oucs.ox.ac.uk/itlp/courses/detail/OTA5 • Reading tip: Developing Linguistic Corpora: a Guide to Good Practice http://ota.ox.ac.uk/documents/creating/dlc/
Corpus Linguistics (4) More than the text: Annotation - what, why, how? Ylva Berglund Prytz and Martin Wynne OUCS http://tinyurl.com/669o4zt