270 likes | 396 Views
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra. Outline. Why corpora, why interpreted corpora Many types of annotation - linguistic annotation - non-linguistic annotation New developments. Why corpora?. Linguistics linguistic theory.
E N D
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra
Outline • Why corpora, why interpreted corpora • Many types of annotation - linguistic annotation - non-linguistic annotation • New developments
Why corpora? Linguistics linguistic theory Engineering language technology applications Cognition models of human language processing
Empirical linguistics introspective data research experimental psycholinguistic data corpus data DB of relevant data
Engineering motivation • information extraction • question-answering • statistical machine translation • parser training and evaluation => increased need for deeply annotated corpora
Cognitive motivation • experience-oriented frequency-based models • models of gradiant grammaticality • metrics of complexity
Resource description metadata language: Spanish, English, German sublanguage/register: regional dialect, sociolect, vernacular, professional jargon, toddler speech text sort(s): newspaper articles, wire news, political speech, control commands subject domain: stock rates, flight reservations, type of producers: professional journalist, student, radiologist mode of production: spoken, written, signed, morsed medium of production: pencil, PC with MS Word, dictaphone conditions of production: spontaneous, carefully composed, produced under time pressure transmission encoding: raw ascii code, HTML, digitized phone signal, unicode medium of transmission: telephone, WWW, CB radio storage encoding: raw ASCII code, HTML, AIFF medium of storage: DAT tape, CD ROM, hard disk mode of presentation: spoken, written, signed medium of presentation: newspaper, radio, book, tv show, theater performance, type of intended recipients: newspaper reader, booking agent, theater audience number of intended recipients: point-to-point, multicast, broadcast synchronicity of discourse: synchronous dialogue, asynchronous direction: one-way, two-way
Linguistic annotation • part-of-speech tags, • word sense information, • morphosyntactic features of words, • constituent structures for phrases or sentences, • coreference markers, • dependency structures, • predicate-argument structures, • reference identifications for term phrases, • information structures within sentences, • intonation contours, • speech acts, • discourse relations - discourse structures.
Other annotations • judgements of native speakers on the acceptability or appropriateness of the utterance, • information on speaker(s), • information on hearer(s) or intended audience, • information on the utterance situation (time, place, circumstances) • information on the published source, • typographic information, • layout and document structure, • textual transcriptions of spoken utterances, • transcription of pauses, • error tagging.
Raw vs. linguistically interpreted corpora search term: word=form ...play a significant part in determining growth and form. ...each molecule can form four hydrogen bonds... vs. search term: word=form & pos=N ...play a significant part in determining growth and form. search term: word=form & pos=V ...each molecule can form four hydrogen bonds...
search term: is *ed Alpha interferon is produced by white blood cells... search term: were *ed In the late 1970s interferons were hailed as "wonder drugs"... vs. search term: pos=VB {0,1} pos=VVN Gamma is not induced by viruses at all... So interferons could be described as the antibiotics of the virus... Only two of these have yet been identified... Raw vs. linguistically interpreted corpora
Syntactically annotated corpora: treebanks • German treebank project: TiGer Treebank • English reference treebank: Penn Treebank • Treebank + semantic information: Prague Dependency Bank
TiGer Treebank S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in
TiGer Treebank S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in annotation on word level: part-of-speech, morphology, lemmata
TiGer Treebank node labels: phrase categories S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in
TiGer Treebank edge labels: syntactic functions S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in
TiGer Treebank crossing branches for discontinuous constituency types S HD SB OC VP MO OA HD PP NP NP AC NK NK NK NK NK NK nächsten ADJA Sup.Dat. Sg.Neut nahe . $. Jahr NN Dat. Pl.Neut Jahr will VMFIN 3.Sg. Pres.Ind wollen die ART Nom. Sg.Fem die Regierung NN Nom. Sg.Fem Regierung ihre PPOSAT Acc. Pl.Masc ihr Reformpläne NN Acc. Pl.Masc Plan umsetzen VVINF Inf umsetzen Im APPRART Dat in
Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) annotation on word level: part-of-speech
Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) phrase categories
Penn Treebank ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) )) syntactic functions
Prague Dependency Bank chce wants Sb investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T ste hundred Obj RESTR.F do to AuxP automobilu car Adv DIR.F korun crowns Atr PAT.F
Prague Dependency Bank chce wants Sb annotation on word level: lemmata, morphology investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T ste hundred Obj RESTR.F do to AuxP automobilu car Adv DIR.F korun crowns Atr PAT.F
Prague Dependency Bank chce wants Sb investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T syntactic functions ste hundred Obj RESTR.F do to AuxP automobilu car Adv DIR.F korun crowns Atr PAT.F
Prague Dependency Bank chce wants Sb investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T dependency structure ste hundred Obj RESTR.F do to AuxP automobilu car Adv DIR.F korun crowns Atr PAT.F
Prague Dependency Bank chce wants Sb investovat to-invest Obj ACT.VOL.T Kdo who Sb ACT.T ste hundred Obj RESTR.F do to AuxP semantic information on constituent roles, theme/rheme, etc. automobilu car Adv DIR.F korun crowns Atr PAT.F
New developments • historical dimension (e.g., Corpus of the History of German Language) • multilayer stand-off linguistic markup • multimodal markup/interpretation • new types of treebanks: • CS treebanks with dependency links (NEGRA, TIGER) • machine-annotated corpora for statistical training (e.g., Redwoods Treebank) • Dependency (Tree)Banks (Prague, PARC) • Grammatical Relation (Tree)Banks (Briscoe & Carroll)