1 / 43

Using Corpora for Language Research

Using Corpora for Language Research. COGS 523-Lecture 3 Corpus Annotation. Related Readings. Course Pack: Meyer (2002) Ch4; Sampson and McCarthy (2005) Ch 39; Garside (1997) Chs 4,5,16 Optional: McEnery et al (2006): A3, A4, A8, A9

deepak
Download Presentation

Using Corpora for Language Research

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Using Corpora for Language Research COGS 523-Lecture 3 Corpus Annotation COGS 523 - Bilge Say

  2. Related Readings • Course Pack: Meyer (2002) Ch4; Sampson and McCarthy (2005) Ch 39; Garside (1997) Chs 4,5,16 • Optional: McEnery et al (2006): A3, A4, A8, A9 • For your reference, rest of Garside et al. (1997) is relatively old but useful. Slides with tagged text are adopted from McEnery and Wilson (2001) or McEnery et al(2006) except TEI encodings (see http://www.tei-c.org/Support/Learn/) COGS 523 - Bilge Say

  3. Mark-up and Annotation • Corpus Mark-up: System of codes inserted into a document stored in electronic form to provide information about the text itself and govern formating: exs: Text Encoding Initiative TEI • Corpus Annotation: Addition of interpretive, linguistic information to an electronic corpus of spoken and/or written data • Sometimes used interchangeably Conflict:Utility of Annotations vs Ease of Annotations COGS 523 - Bilge Say

  4. Other Issues • Standards vs Guidelines • Manual vs Automatic Annotation • Documentation • Evaluation of Annotation Schemes • See LREC conferences.... COGS 523 - Bilge Say

  5. Maxims in Annotation of Text Corpora (Leech, 93) • Removable-revertable • Extractable • End user guidelines available • Annotation mode and annotator info clear • Reliability available • Annotation schemes – theory neutral or widely agreed upon ? • No a priori standard COGS 523 - Bilge Say

  6. Cross-Linguistic Annotation Standards • Reusability and Shareability • Ease and Efficiency in Building a Corpus • Crosslinguistic Comparability Examples: TEI, CES; EAGLES (Expert Advisory Group for Language Engineering Standards) COGS 523 - Bilge Say

  7. Problems with Standardization: • Applicability of standards to existing or ongoing corpus research • Acceptibility of standards by general linguistic community • Task dependency of corpora • Applicability to a wide range of languages COGS 523 - Bilge Say

  8. Documentation of Markup/Annotation Guidelines • What should be specified in a annotation guidelines document? • Level and layers of annotation • Set of annotation devices used and their meanings • Conventions for applying such devices defined - supplemented with examples or a reference corpus • Granularity of annotation • Disambiguation process applied (if any) • Measurable quality of annotation (accuracy rate, consistency rate, extent of manual checking) • Any incompleteness, known errors etc. COGS 523 - Bilge Say

  9. Markup • A.k.a structural annotation • Different conventions for line breaks, sections, lists etc exist. What does that imply? • Character Sets (Unicode, ISO639-3) • Textual Information • COCOA References <A Charles Dickens> • Standard Generalized Markup Language (SGML) • Hypertext Markup Language (HTML) • Extensible Markup Language (XML) COGS 523 - Bilge Say

  10. XML • Three characteristics of XML distinguish it from other markup languages: • its emphasis on descriptive rather than procedural markup; • its notion of documents as instances of a document type; • its independence of any one hardware or software system. COGS 523 - Bilge Say

  11. Text Encoding Initiative (TEI) • Objective: The Development of an Interchange Language for Textual Data • Started in 1987 • Original P3 documentation 1400 pages • Currently in P5 with extensive web support (see Links) • TEILite: Simplified by a factor of 3. • Moved from SGML to XML • Flexible tagset • Document Type Definitions (DTDs, rules for a particular markup language, i.e. elements, attributes, entities), more flexible and optional • XSL – Extensible Style Language • Simpler and better syntax • Corpus Encoding Standard (CES) and XCES • an attempt to specialize XML for corpora (not currently fully compliant to TEI P5 but many commonalities) (see Links) COGS 523 - Bilge Say

  12. TEI • Alternative customizations: • tei_bare: TEI Absolutely Bare • teilite: TEI Lite • tei_corpus: TEI for Linguistic Corpora • tei_ms: TEI for Manuscript Description • tei_drama: TEI with Drama • tei_speech: TEI for Speech Representation COGS 523 - Bilge Say

  13. An example of a feature system declaration (FSD) <fs id=vvd type=word-form> <f name=verb-class><sym value=verb> <f name=base><sym value=verb> <f name=verb-form><sym value=lexical> <f name=verb-class><sym value=past> </fs> COGS 523 - Bilge Say

  14. Examples of SGML tags <Q></Q> encloses a question <EX></EX> encloses an expansion of an abbreviation in the original manuscript <LB> indicates a line break <FRN></FRN> encloses words in another language; Lang=“LA” indicates Latin COGS 523 - Bilge Say

  15. Example of XML breakfast food menu COGS 523 - Bilge Say

  16. TEI P5 structure • The TEI encoding scheme consists of a number of modules, each of which declares particular XML elements and their attributes. (from TEI Guidelines) • Modules: core, header, textstructure, corpus ... COGS 523 - Bilge Say

  17. TEI for Language Corpora –text descriptions • channel (primary channel) describes the medium or channel by which a text is delivered or experienced. For a written text, this might be print, manuscript, e-mail, etc.; for a spoken one, radio, telephone, face-to-face, etc. modespecifies the mode of this channel with respect to speech and writing. • constitution describes the internal composition of a text or text sample, for example as fragmentary, complete, etc. typespecifies how the text was constituted. • derivation describes the nature and extent of originality of this text. typecategorizes the derivation of the text. • domain (domain of use) describes the most important social context in which the text was realized or for which it is intended, for example private vs. public, education, religion, etc. typecategorizes the domain of use. • factuality describes the extent to which the text may be regarded as imaginative or non-imaginative, that is, as describing a fictional or a non-fictional world. typecategorizes the factuality of the text. ... (from Guidelines) COGS 523 - Bilge Say

  18. TEI Elements • Elements: • Major structuring elements: text, body, front, back.. • Paragraph level elements: citation, speaker.. • Lists, tables, figures • Phrase level elements: date, emph, foreign • Bibliographical elements: author, publisher • Others: file description, revision description • Attributes: • <div type=“chapter” n=“1”> ...</div> • Entities • &auml; for ä COGS 523 - Bilge Say

  19. A Text Description Example • textDesc ="Informal domestic conversation"> <channel mode="s">informal face-to-face conversation</channel> <constitution type="single">each text represents a continuously recorded interaction among the specified participants</constitution> <derivation type="original"/> <domain type="domestic">plans for coming week, local affairs</domain> <factuality type="mixed">mostly factual, some jokes</factuality> <interaction  type="complete" active="plural" passive="many"/> <preparedness type="spontaneous"/> <purpose type="entertain" degree="high"/> <purpose type="inform" degree="medium"/></textDesc> COGS 523 - Bilge Say

  20. A Sample Participant Description • <person sex="2" age="mid"> <birth when="1950-01-12"><date>12 Jan 1950</date> <name type="place">Shropshire, UK</name> </birth> <langKnowledge tags="en fr"> <langKnown level="first" tag="en">English</langKnown> <langKnown tag="fr">French</langKnown> </langKnowledge> <residence>Long term resident of Hull</residence> <education>University postgraduate</education> <occupation>Unknown</occupation> <socecStatus scheme="#pep" code="#b2"/></person> COGS 523 - Bilge Say

  21. Example of TEI Header from University of Michigan Library COGS 523 - Bilge Say

  22. Adopting XML-based linguistic annotation • Technical difficulties – human perceptual difficulties • Not conformant to how linguistic knowledge is expressed in many layers of linguistic annotation... COGS 523 - Bilge Say

  23. Types of Annotation • Morphosyntactic • Part-of-speech tagging; partial or full parse • Semantic • Word senses, thematic roles • Discourse • Information structure, anaphoric relations, discourse relations • Prosodic (e.g Intonation) • Pragmatic (e.g. Speech acts) • Problem Understanding (see Message Understanding (MUC) or Document Understanding Conferences (DUC)) COGS 523 - Bilge Say

  24. POS Tagging • Obligatory attributes or values: major word categories • Recommended attributes or values:type,gender, case • Optional: semantic classes, language specific information, derivational morphology COGS 523 - Bilge Say

  25. Tagsets • Issues: Conciseness, ease of interpretation, analysability, disambiguatibility, linguistic quality vs computational tractability– trade-offs... • Size of tagsets: English 30-200; Spanish 475, Turkish 6000 distinct morphological feature combinations for 250,000 words • What to do with multiwords: in spite of (ditto tags), mergers (clitics, eg hasn’t), compounds (eye strain vs eyestrain) COGS 523 - Bilge Say

  26. Tagging Accuracy • Amount of training data available • The size of the tagset • Training data,dictionary vs the real corpus – the differences • Unknown words • Recall and Precision • 2-6% error rate for English COGS 523 - Bilge Say

  27. COGS 523 - Bilge Say

  28. Example of part-of-speech tagging from LOB corpus (C1 tagset) P05 32 ^ Joanna_NP stubbed_VBD out_RP her_PP$ cigarette_NN with_IN P05 32 unnecessary_JJ fierceness_NN ._. P05 33 ^ her_PP$ lovely_JJ eyes_NNS were_BED defiant_JJ above_IN P05 33 cheeks_NNS whose_WP$ colour_NN had_HVD deepened_VBN P05 34 at_IN Noreen’s_NP$ remark_NN ._. COGS 523 - Bilge Say

  29. Example of part-of-speech tagging from Spoken English Corpus (C7 tagset) ^ For_IF the_AT members_NN2 of_IO this_DD1 university_NN1 this_DD1 character_NN1 enshrines_VVZ a_AT1 victorious_JJ principle_NN1 ;_; and_CC the_AT fruits_NN2 of_IO that_DD1 victory_NN1 can_VM immediately_RR be_VBI seen_VVN in_II the_AT international_JJ community_NNJ of_IO scholars_NN2 that_CST has_VHZ graduates_VVN here_RL today_RT ._. COGS 523 - Bilge Say

  30. Example of part-of-speech tagging from the British National Corpus (C5 tagset in TEI-conformant layout) Predita&NN1-NP0; , &PUN; covering&VVG; the&AT0; bottom&NN1; of&PRF; the&AT0; lorries&NN2; with&PRP; straw&NN1; to&TO0; pritect&VVI; &bquo;&PUQ;I&PNP;’11&VM0; polish&VVI; your&DPS; boots&NN2; , &PUN; &equo;&PUQ; he&PNP; offered&VVD; .&PUN; COGS 523 - Bilge Say

  31. Example from the CLAWS system 0000117 040 I 03 PPIS1 0000117 050 do 03 VD0 0000117 051 n’t 03 XX 0000117 060 think 99 VVI COGS 523 - Bilge Say

  32. Syntactic Annotation • More problematic than POS tagging. Can you guess why? • Proposed levels • Bracketing of segments • Labeling of segments • Marking of dependency relations, eg complements • Indicating functional labels, e.g. Subject, object • Extra: ellipsis, traces... COGS 523 - Bilge Say

  33. Treebanks • Penn Treebank – the initiator • Treebanks for Swedish, Danish, German, Dutch, French, Turkish, Czech, Spanish, Basque, Russian, Chinese Portuguese, Italian... • Sizes: 700 to 90,000 sentences • Automated and manual annotation • Grammar formalisms: Context-free grammar trees, dependency, LFG, HPSG, CCG COGS 523 - Bilge Say

  34. Example of full parsing from the Lancaster-Leeds treebank [S[Ncs another_DT new_JJ style_NN feature_NN Ncs] [Vzb is_BEZ Vzb] [Ns the_ATI [NN/JJ& wine-glass_NN [JJ+ or_CC flared_JJ JJ+]NN/JJ&] heel_NN ,_, [Fr[Nq which_WDT Nq] … Fr]Ns] ._. S] COGS 523 - Bilge Say

  35. Example of skeleton parsing from Spoken English Corpus [S&[P For_IF [N the_AT members_NN2 [P of_IO [N this_DD1 university_NNl1 N]P]N]P] [N this_DD1 character_NN1 N] [V enshrines_VVZ [N a_AT1 victorious_JJ principle_NN1 N]V]S&]._. COGS 523 - Bilge Say

  36. From Penn Treebank ((S (NP-SBJ-1 (NP Yields) (PP on (NP money-market mutual funds))) (VP continued (S (NP-SBJ *-1) (VP to , (VP slide))) (PP-LOC amid (NP signs (SBAR that (S (NP-SBJ portfolio managers) (VP expect (NP (NP further declines) (PP-LOC in (NP interest rates)))))))))

  37. Tiger Treebank – A German treebank <n id="n1_500" cat="S"> <edge href="#id(w1)"/> <edge href="#id(w2)"/> </n> <w id="w1" word="the"/> <w id="w2" word="boy"/> COGS 523 - Bilge Say

  38. Semantic Annotation • Makes sense in linguistic or psycholinguistic terms • Applicable to whole corpus • Flexible and right level of granularity • Hierarchical structure (?) • Conforming to standards (Schmidt, 88) COGS 523 - Bilge Say

  39. Other Issues • Harder to annotate • Can be computer assisted if appropriate interfaces to lexical resources are developed • General frequency information can help in disambiguation COGS 523 - Bilge Say

  40. Example of semantic text analysis, based upon Wilson (1996) And 00000000 the 00000000 soldiers 23241000 platted 21072000 a 00000000 crown 21110400 of 00000000 thorns 13010000 and 00000000 put 21072000 it 00000000 on 00000000 his 00000000 head 21030000 Key: 00000000 Low content word 13010000 Plant life in general 21030000 Body and body parts 21072000 Object-oriented physical activity 21110321 Men’s clothing: outer clothing 21110400 Headgear 23241000 War and conflict: general 31241100 Color

  41. Example of anaphoric annotation from Lancaster Anaphoric Treebank. A039 1 v (1 [N Local_JJ atheists_NN2 N] 1) [V want_VVO (2 [N the_AT (9 Charlotte_NP1 9) Police_NN2 Department_NNJ N] 2) [Ti to_TO get_VV0 rid_VVN of_IO [N (3 <REF=2 its_APP$ chaplain_NN1 3) ,_, [N {{3 the_AT Rev._NNSB1 Dennis_NP1 Whitaker_NP1 3} ,_, 38_MC N]N]Ti]V] ._. COGS 523 - Bilge Say

  42. Example of codes of prosodic annotation. # end of tone group ^ onset / raising nuclear tone \ falling nuclear tone /\ raise-fall nuclear tone _ level nuclear tone [] enclose partial words and phonetic symbols Also represent unintelligable speech background noise overlapping speech (conventions exist) Changing names for privacy

  43. Lecture 4 Using corpora w. other resources and corpus query tools (general); corpus/treebank quality control. Readings: Buchholz and Green (2006); Miller and Fellbaum (2007); Sampson and McCarthy Ch 29. Due Date: Project Proposals COGS 523 - Bilge Say

More Related