330 likes | 649 Views
LIN 3098 Corpus Linguistics – Lecture 4. Albert Gatt. In this lecture. Levels of annotation Corpus typology classification based on type and levels of annotation multilingual corpora. Part 1. Levels of corpus annotation (cont/d). Levels of linguistic annotation.
E N D
LIN 3098Corpus Linguistics – Lecture 4 Albert Gatt
In this lecture • Levels of annotation • Corpus typology • classification based on type and levels of annotation • multilingual corpora LIN 3098 -- Corpus Linguistics
Part 1 Levels of corpus annotation (cont/d)
Levels of linguistic annotation • part-of-speech (word-level) • lemmatisation (word-level) • parsing (phrase & sentence-level) • semantics (multi-level) • semantic relationships between words and phrases • semantic features of words • discourse features (supra-sentence level) • phonetic transcription • prosody LIN 3098 -- Corpus Linguistics
Lemmatisation • Groups morphological variants of a word under the head word: • mexa’ (walk) • imxejt (I walked) • imxejna (we walked) • nimxu (we walk) • ... • Increasingly common these days. Together , these form a lemma LIN 3098 -- Corpus Linguistics
Lemmatisation example: the SUSANNE corpus • Format: word + tag + lemma A05:0030.33 - VVDv said say • Every word in the corpus is on separate line. • Extremely useful for lexicography Corpus file:sentence.word POS tag actual word head word (lemma) LIN 3098 -- Corpus Linguistics
Automatic morphological analysis • For some languages, there are reasonably good lemmatisers/ morphological analysers: • Examples for English: • morpha: built at the University of Sussex • EngTwol: commercial, by LingSoft. LIN 3098 -- Corpus Linguistics
Engtwol output • undeniable: • "undeniable" <DER:ble> A ABS • (derived with –ble suffix) • adjective (A) • absolute (ABS) form • This is a rule-based analyser. There are others which use corpus-derived statistical patterns. LIN 3098 -- Corpus Linguistics
Semantic annotation I: Two types • markup of semantic relations (e.g. predicate-argument structure) • currently used in parsed corpora, to mark up function-argument structures etc. • markup of features of word meaning (mainly, word senses) • has origins in content analysis to arrive at conclusions about how prominent particular concepts are • Now used in a lot of work on word sense disambiguation LIN 3098 -- Corpus Linguistics
Example of type 1 semantic markup (Penn Treebank) (S (NPSBJ1 Chris) (VP wants (S (NPSBJ *1) (VP to (VP throw (NP the ball)))))) • Predicate Argument Structure: wants(Chris, throw(Chris, ball)) Empty embedded subject linked to NP subject no. 1 LIN 3098 -- Corpus Linguistics
Semantic markup type 2: lexical features • Most common type: • word-sense tagged corpora • Main idea: • disambiguate a word in context by tagging its sense • Often uses WordNet (Miller et al 1993) • WordNet is a lexical taxonomy which represents lexical relations within a large number of words. • including hyponymy (IS-A) relations etc • For each entry, all the (supposed) senses of the word are given. • Main use: identify senses of words in context, mark them up with a pointer to a wordnet sense. LIN 3098 -- Corpus Linguistics
WordNet senses: Move (noun) (377) move -- (the act of deciding to do something; "he didn't make a move to help"; "his first move was to hire a lawyer") (70) move, relocation -- (the act of changing your residence or place of business; "they say that three moves equal one fire") (57) motion, movement, move, motility -- (a change of position that does not entail a change of location; "the reflex motion of his eyebrows revealed his surprise"; "movement is a sign of life"; "an impatient move of his hand"; "gastrointestinal motility") (30) motion, movement, move -- (the act of changing location from one place to another; "police controlled the motion of the crowd"; "the movement of people from the farms to the cities"; "his move put him directly in my path") (5) move -- ((game) a player's turn to take some action permitted by the rules of the game) LIN 3098 -- Corpus Linguistics
(130) travel, go, move, locomote -- (change location; move, travel, or proceed; "How fast does your new car go?"; "We travelled from Rome to Naples by bus"; "The policemen went from door to door looking for the suspect"; "The soldiers moved towards the city in an attempt to take it before night fell") (60) move, displace -- (cause to move, both in a concrete and in an abstract sense; "Move those boxes into the corner, please"; "I'm moving my money to another bank"; "The director moved more responsibilities onto his new assistant") (52) move -- (move so as to change position, perform a nontranslational motion; "He moved his hand slightly to the right") (20) move -- (change residence, affiliation, or place of employment; "We moved from Idaho to Nebraska"; "The basketball player moved from one team to another") WordNet senses: Move (verb) LIN 3098 -- Corpus Linguistics
Check it out! • Wordnet is freely available for download: • http://wordnet.princeton.edu/ LIN 3098 -- Corpus Linguistics
Word sense annotation: other uses • tagging words with their semantic field (Wilson 1996) • plant life • men’s clothing • … • tagging words with their “emotional” content (Campbell & Pennebaker 2002) based on a dictionary: • social processes • negative emotions • This approach underlies Pennebaker’s Linguistic Inquiry and WordCount (LIWC) system, • analyses a text and comes up with a profile of its personal/emotional content • relates this to some features of its author (gender, age…) LIN 3098 -- Corpus Linguistics
Discourse annotation • Most common: • text-level things such as paragraphs • Less common: • anaphoric NPs and reference (cf. example from lecture 3) • Even less common: • annotation of words which function as discourse cues (Stenstrom 1984): • apology (sorry), hedges (sort of), etc • annotation of rhetorical structure LIN 3098 -- Corpus Linguistics
Discourse: Annotating rhetorical structure (I) • Rhetorical Structure Theory (Mann and Thompson 1988): • views text as made up of “discourse units” • units stand in various rhetorical relations, which reflect their role in constructing an argument, a narrative, etc • CONCESSION/CONTRAST relation: • [Although Mr. Freeman is retiring,] [he will continue to work as a consultant for American Express on a project basis]. • Second unit is the main one (nucleus) • First unit (satellite) “concedes” that what the main unit is saying is contradicted by another fact. • Recent corpus (Marcu et al 2003) is annotated with this information. LIN 3098 -- Corpus Linguistics
Phonetic transcription • Not many phonetically transcribed corpora. • MARSEC corpus is one of the best known. This is a version of the Lancaster/IBM Spoken English Corpus. • Several databases of transcribed speech, however. Mostly used for statistical speech technology applications (e.g. text-to-speech synthesis). LIN 3098 -- Corpus Linguistics
Annotating suprasegmentals • Aims: capture suprasegmental features such as stress, intonation and pauses in spoken speech. • Some transcription systems exist • TOBI (American) • Tonic Stress Marker (TSM; British) • define ways of annotating suprasegmentals such as start/end of tone group; simultaneous speech, rise-fall tone, falling tone, etc… LIN 3098 -- Corpus Linguistics
Problem-oriented tagging • If you’re interested in a particular problem, and no corpus exists, build your own! • Many corpora define problem-specific annotation schemes. LIN 3098 -- Corpus Linguistics
Example: the TUNA Corpus • Problem: How do people refer to objects using definite NPs? • Main interest: visual properties (colour, size etc) • Focus: semantics of definite NPs, i.e. what people choose to include in their description. • Method: • experiment to get people to describe objects, distinguishing them from other objects in the same visual “scene” • annotation of descriptions based on semantics LIN 3098 -- Corpus Linguistics
TUNA Corpus: description <DESCRIPTION NUM="SINGULAR"> <ATTRIBUTE NAME="colour" VALUE="red"> red </ATTRIBUTE> <ATTRIBUTE NAME="type" VALUE="sofa"> sofa</ATTRIBUTE> <ATTRIBUTE NAME="size" VALUE="large"> bigger version </ATTRIBUTE> </DESCRIPTION> Red sofa, bigger version. • Features of the corpus: • represents the “target” referent • also represents the “distractors” (from which the target must be distinguished) • semantically transparent: annotation goes beyond language LIN 3098 -- Corpus Linguistics
Part 2 Multilingual corpora
Why multilingual corpora? • comparative studies • syntax • morphology • … • the cornerstone of most research in automatic machine translation nowadays • most MT systems are statistical, trained on large repositories of parallel (e.g. English-Chinese) text. LIN 3098 -- Corpus Linguistics
Parallel corpora • Represents a text in its original language (L1), with a translation in another language (L2) • long history: Medieval polyglot bibles were among the first “parallel” corpora • Alignment: • Many parallel corpora align L1 and L2 at sentence level, sometimes also at word level… • Sentence-level alignment can be achieved automatically with very high accuracy! LIN 3098 -- Corpus Linguistics
Example: SMULTRON corpus • Developed and released in 2007-8 • Relatively small • Aligned texts in English, Swedish and German • E.g. Sophie’s World is one of the texts • Annotated with syntax, POS, morphology • Comes with a tool to view parallel syntactic trees. LIN 3098 -- Corpus Linguistics
SMULTRON example: English (Sophie’s World) <s id=“s3”> <terminals> <t id="s3_1" word="Sophie" pos="NNP" morph="--"/> <t id="s3_2" word="Amundsen" pos="NNP" morph="--"/> <t id="s3_3" word="was" pos="VBD" morph="--"/> <t id="s3_4" word="on" pos="IN" morph="--"/> <t id="s3_5" word="her" pos="PRP$" morph="--"/> <t id="s3_6" word="way" pos="NN" morph="--"/> <t id="s3_7" word="home" pos="RB" morph="--"/> <t id="s3_8" word="from" pos="IN" morph="--"/> <t id="s3_9" word="school" pos="NN" morph="--"/> <t id="s3_10" word="." pos="." morph="--"/> </terminals> </s> This shows terminal nodes only. Corpus Also represents syntactic non-terminals (NP, VP etc) LIN 3098 -- Corpus Linguistics
SMULTRON: Same sentence in German <s id=“3”> <terminals> <t id="s3_1" word="Sofie" pos="NE" morph="FEM" lemma="Sofie " /> <t id="s3_2“ word="Amundsen" pos="NE" morph="--" lemma="Amundsen“ /> <t id="s3_3" word="war" pos="VAFIN" morph="--" lemma="sein"/> <t id="s3_4" word="auf" pos="APPR" morph="--" lemma="auf" /> <t id="s3_5" word="dem" pos="ART" morph="--" lemma="der" /> <t id="s3_6" word="Heimweg" pos="NN" morph="MASK" lemma="Heimweg“ /> <t id="s3_7" word="von" pos="APPR" morph="--" lemma="von" /> <t id="s3_8" word="der" pos="ART" morph="--" lemma="die" /> <t id="s3_9" word="Schule" pos="NN" morph="FEM" lemma="Schul~e" /> <t id="s3_10" word="." pos="$." morph="--" lemma="--" /> </terminals> </s> Note: richer morphology, representation of lemmas, … LIN 3098 -- Corpus Linguistics
Translation corpora • Not parallel. • Have different texts in two or more different languages, of the same genre. • Examples: • PAROLE corpus is a translation corpus for EU languages LIN 3098 -- Corpus Linguistics
Why translation corpora? • Parallel corpora, by definition, contain translation (L2) • can give rise to errors • artificiality and translation quality can be an issue • e.g. McEnery & Wilson report a study on an English-Polish corpus. The Polish text reads “like a translation” • Problem can be overcome if the texts used are professionally translated. • Translation corpora have texts in two or more languages, “in the original”. • Data is more natural. LIN 3098 -- Corpus Linguistics
Summary • We have now concluded our initial incursion into: • corpus construction • corpus annotation • corpus typology • Next up: • using corpora for linguistic research LIN 3098 -- Corpus Linguistics