1 / 30

Annotation of corpora

Annotation of corpora. A. Part-of-speech tagging B. Syntactic annotation C. Semantic annotation D. Discourse annotation E. Pragmatic annotation. Annotation of corpora. perfectly plain: produced by scanning; no information about text (usually, not even edition)

Download Presentation

Annotation of corpora

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Annotation of corpora • A. Part-of-speech tagging • B. Syntactic annotation • C. Semantic annotation • D. Discourse annotation • E. Pragmatic annotation

  2. Annotation of corpora • perfectly plain: produced by scanning; no information about text (usually, not even edition) • marked up for formatting attributes: e.g. page breaks, paragraphs, font sizes, italics, etc. • annotated with identifying information, e.g. edition date, author, genre, register, etc. • annotated for part of speech, syntactic structure, discourse information, etc.

  3. A. Part-of-speech tagging LOB sample with POS tagging A01 2 ^ *'_*' stop_VB electing_VBG life_NN peers_NNS **'_**' ._. A01 3 ^ by_IN Trevor_NP Williams_NP ._. A01 4 ^ a_AT move_NN to_TO stop_VB \0Mr_NPT Gaitskell_NP from_IN A01 4 nominating_VBG any_DTI more_AP labour_NN A01 5 life_NN peers_NNS is_BEZ to_TO be_BE made_VBN at_IN a_AT meeting_NN A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.

  4. A. Part-of-speech tagging • Main steps: • Divide the text into word tokens (tokenization) • Select a set of tags • Apply tag set to tokens • Tokenization: • orthographic word - morpho-syntactic unit? • multiwords, e.g., in spite of label as in_PREP31 spite_PREP32 of_PREP33 • mergers, e.g., clitics as in hasn’t, je t’aime, vendetelo label as vendete_VERBlo_PRON • compounds, e.g., tag set label as tagset_NOUN or tag_NOUN set_NOUN?

  5. A. Part-of-speech tagging • Choice of tag set • sophisticated, linguistically well grounded set of tags… • BUT: not automatically applicable without loss of accuracy • example: come - present plural indicative, imperative, subjunctive; Lancaster corpus: distinguish from to-infinitive, LOB, Brown corpus: don’t distinguish

  6. A. Part-of-speech tagging • tag = word class • label = alphanumeric characters • examples: preposition preposition prep IN singular proper noun NOUN:prop:sing N-p-sg NP1 • logically organized (taxonomy), e.g., in Lancaster, BNC, C7 • presentation: horizontal or vertical

  7. A. Part-of-speech tagging • encoding of tags • TEI (SGML), e.g., BNC <w AV0>Even <w AT0>the <w AJ0> old <w NN2>women <w VVB>manage <c PUN>, <w AVO>just <w CJS>as <w PNP>they <w VVB>’re <w VVG>passing <wPNP>you <c PUN>.</PUN> (Garside et al., 1997)

  8. A. Part-of-speech tagging • Applying tags to words • tagging scheme should include a procedure of how to assign tags to words (both for humans and machines) • need a lexicon: it will say which tags are assignable to which words • again: ambuguity is a problem

  9. B. Syntactic annotation • syntactic annotation = parsed corpora • purposes: • training automatic parsers (computational linguistics, e.g. probabilistic parsers - inductive training through extraction of frequency counts) • extracting information (linguistics, e.g., building a lexicon, investigating subcategorization frames, collocations or other linguistic things, describing sublanguages)

  10. B. Syntactic annotation • a parsing scheme needs (cf. POS tagging): • a list of symbols • definitions of symbols • description of how to apply symbols to text • syntactically annotated corpora: tree banks • examples of tree banks: Penn Treebank, Nijmegen Treebank, Susanne Corpus , Helsinki Constraint Grammar (ENGCG), Lancaster/IBM SEC treebank

  11. B. Syntactic annotation • Parsing • the (automatic) analysis of texts (sentences) in terms of syntactic categories S NP VP NP PP NP NP ADJP NP Pierre 61 old will join the as an executive Nov 29 Vinken years board director

  12. B. Syntactic annotation • Penn Treebank • skeleton parsing: partial parse, leaving out the “hard” things (such as PP-attachment) • phrase structure model (Garside et al., 1997, p.42) ((S (NP (NP Pierre Vinken) , (ADJP (NP 61 years) old ,)) will (VP join (NP the board) (PP as (NP a nonexecutive director)) (NP Nov 29))) .)

  13. B. Syntactic annotation • Penn Treebank • available through LDC • size: 3,300,000 words (Feb 97) • Brown corpus, Wall Street Journal • in the current phase: • add function labels (Subj, Obj etc.) • add null constituents or traces (e.g., It’s easy [t] to eat) • add indices for coreferences (e.g., Mary[i] saw herself[i] in the mirror) • discontinuous constituents • add semantic roles (Agent, Goal etc) • may get too complex for large-scale reliable analysis…

  14. B. Syntactic annotation • Susanne Corpus • part of the Brown corpus, 128,000 words • result of manual analysis • parsing scheme specified in great detail • available from Oxford Text Archive: • sable.ox.ac.uk/ota (http) • ota.ox.ac.uk/pub/ota/public (ftp)

  15. A./B. Demo • TIGER • NEGRA

  16. C. Semantic annotation • problem (1): more than one way of referring to a concept, e.g., • text analysis: choice of expression may reflect ideologies in the text or relationships between participants in conversation, for example, in doctor-patient interaction abdomen --- tummy • information retrieval: historian in fashion seeks information about trousers trousers --- slacks, shorts, leggings, breeches --> cf. RECALL in IR

  17. C. Semantic annotation • problem (2): one single word can refer to different concepts, e.g., • information retrieval: historian in fashion wants to know about boots boot --- may refer to shoe, computer, kick, car --> cf. PRECISION in IR • so: • need to identify related words (problem 1) • need to identify the different senses of a word (problem 2)

  18. C. Semantic annotation • labeling words according to semantic field (word senses) so that you can • … extract all the related words by querying on the semantic field • … extract only those instances of ambiguous words with the specific senses you want by querying on the combination of word and semantic field

  19. C. Semantic annotation • semantic fields: sense relations and other kinds of relations (e.g., part-of, related-to etc.) • annotation (cf. PoS tagging): • definition of the tagging scheme (labels and their meanings) • guidelines for applying the tagging scheme • in semantics: this is not as easy and straightforward as for PoS tagging! • requirements: • should make linguistic/psycholinguistic sense • should be able to account for the vocabulary in the corpus exhaustively • should be suitable for texts from different periods and register (comprehensiveness) • should preferably have a hierarchical structure

  20. C. Semantic annotation • multiple membership, e.g., deepened: color and change/remain • multiword units, e.g., stubbed out: encoded as two separate words, but belonging together • one recent ambitious attempt at a taxonomy of such semantic relations (sense relations, thesaurus-type relations, semantic fields etc.): WORDNET at www.cogsci.princeton.edu/~wn/ • you can try it online: www.cogsci.princeton.edu/~wn/online/

  21. C. Semantic annotation • How to do it? • manually • computer-assisted (need at least a computer-readable lexicon and a disambiguation process - similar to PoS tagging) • fully automatic (not really feasible): • semantic analysis is even harder than syntactic parsing • no integrated ‘parse’ of meaning possible at the present time

  22. D. Discourse annotation • discourse features: what are they? • Typically: cohesion and coherence • coherence: what makes a text hang together in terms of content • cohesion: the means of making a text hang together • reference, substitution, ellipsis, conjunctive relations (cause, result, effect etc.), thematic development • Halliday & Hasan, 76

  23. D. Discourse annotation • example: anaphoric relations in the IBM/Lancaster corpus (UCREL) • try to build up sth. like an ‘anaphoric treebank’ • what are anaphoric relations? • links between a proform and an antecedent • example: The married couple said that they were happy with their lot. The married couple said that they were happy with their lot.

  24. D. Discourse annotation • anaphoric annotation in UCREL: categories used are based n Halliday & Hasan, 76 • example of annotation: (1 Feodor Baumenk 1), a former Nazi death camp guard, has asked the U.S Supreme Court to allow <REF=1 him to retain <REF=1 his American citizenship. (2 The Hartford Courant 2) said… • symbols: (1), (2)… = antecedent < = anaphoric (> = cataphoric) REF = central pronoun

  25. D. Discourse annotation • few corpora annotated for discourse features… • how to do it? • manually • computer-assisted: either interactive hand annotation, using some kind of specialized editor or automatic annotation with the possibility of hand correction or disambiguation • a tool supporting annotation of anaphora: XANADU in Lancaster

  26. E. Pragmatic annotation • anything beyond sentences and discourse: contexts of situation and culture • examples of things people look at in pragmatics • carry-on signals in conversation (e.g., Stenstroem 87): which functions do carry-on signals such as “well”, “you know” etc. have in conversation? • speech acts (e.g., Stiles 92): speech act types in conversation, e.g., in doctor-patient interactions PATIENT: I have the headaches to the point that I have to vomit (D)DOCTOR: Mm -hm (K) PATIENT: Then I have to go to bed and I sleep for a while (E) D = Disclosure K = Acknowledgment E = Edification

  27. E. Pragmatic annotation • how to do it? • manually • computer-assisted: ? • fully-automatic: - • You have to use your imagination! • Stenstroem example: Can be done with a concordance program because it’s essentially word-based • Stiles example: would probably have to be done manually (then use a concordance program on the annotated texts?)

  28. Higher-level annotation: tools • Tools that support specialized analysis, such as • specialized editors, e.g., Xanadu for anaphoric relations • specialized in terms of linguistics models, • e.g., Sys-Tools for Systemic Functional Grammar (minerva.ling.mq.edu.au/) (http://cirrus.dai.ed.ac.uk:8000/Coder/index.html) • e.g., RSTTools for rhetorical relations analysis (www.dai.ed.ac.uk/daidb/people/homes/micko/RSTTool/index.html) • Tools that support various kinds of analysis (but not quite everything you might want to do): • TATOE (www.darmstadt.gmd.de/~rostek/tatoe.htm)

  29. References • Garside R., G. Leech & A. McEnery (eds.), 1997. Corpus Annotation. Linguistic Information from Computer Text Corpora. Longman: London • Fellbaum C. (ed), 1998. WordNet. An Electronic Lexical Database. MIT Press. • Garside et al., 1997. Corpus annotation. London, Longman. • Halliday M.A.K. & R. Hasan, 1976. Cohesion in English. Longman, London. • Mindt, 1991. Syntactic evidence for semantic distinctions in English. In Aijmer & Altenberg (eds), English Corpus Linguistics: Studies in Honour of Jan Svartvik, London, Longman. • Stenstroem, 1987. Carry-on signals in English conversation. In Meijs (ed), Corpus Linguistics and Beyond. Amsterdam, Rodopi. • Stiles, 1992. Describing talk: a taxonomy of verbal response models. Beverly Hills, Sage.

More Related