420 likes | 601 Views
Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven. Syntactic annotation in CGN: lessons learned and to be learned. This talk. Why CGN: Spoken Dutch Corpus? At that time … Other layers Orthographic transcription PoS tagging Syntactic annotation
E N D
Ineke Schuurman Centre for Computational Linguistics Katholieke Universiteit Leuven Syntactic annotation in CGN: lessons learned and to be learned
Paris This talk ... Why CGN: Spoken Dutch Corpus? At that time … Other layers Orthographic transcription PoS tagging Syntactic annotation Dependencies and categories Spoken language “standard” language, disfluencies LASSY/SoNaR: Written Dutch Corpus What to take into account when planning a ‘spoken treebank’
Paris Why CGN? Dutch Language Union Dutch/Flemish organization taking care of common language 1997-8: report state of the art wrt Language & Speech Technology 1998: Spoken Dutch Corpus, 5 years, 2/3 Netherlands - 1/3 Flanders, balanced 1000 hours, +/- 10M words 1 M Syntactic Annotation Both research purposes and services (EU) / industry
Paris At that time This talk: focus on textual aspects! -------------------------------------------------------- No taggers, parsers that could be reused Existing grammars cover(ed) the northern variant of Dutch No ‘formal’ grammar ►start from scratch
Paris Other layers Relevant for syntax: Orthographic transcription PoS tagging All layers in parallel, but per fragment: layer A finished before start layer B (except for errors) Reason: time But: gave us opportunity to express wishes/needs wrt other layers Example: handling of specific types of words .
Paris Transcription and PoS An example:
Paris Specific types of words *v words in another language (not 'adopted' in Dutch) *a not fully realized words (gaan probe instead of gaan proberen) *x words that could not be (fully) understood (also xxx, ggg) *u mispronounced words (ploberen instead of proberen, om-uh-dat*u instead of omdat) *d dialectal words One or more words? zo’nvszo ‘n(such a): one token! But hebde*d (litt. have you) realized as hebt*d de*d : two tokens
Paris Syntactic analysis: goal CGN Annotation in theory-neutral format in order to be useful for as many people as possible Categories: NP, PP, … Functions/dependencies: subject, object1, … As automatic as possible: Tool from NEGRA-corpus: Annotate for German same desiderata as CGN (contrary to Dutch AMAZON-parser) .
Paris Annotate • Developed for NEGRA-project (Saarbrücken) • Oliver Plaehn, Thorsten Brants • Semi-automatic annotation • Works with tagger and parser • Suggests structures • Combined with Cascaded Markov Models (Brants) • Bootstrapping approach possible
Paris Annotate screen • .
Paris Annotate ‘correction’ format
Paris Annotate export format .
Paris Principles of syntactic annotation Structures as flat as possible Only new level when there is a new head No branching when just one node is involved No duplication of functions (1 SU, 1 OBJ1, …) In principle just non-branching heads Allowed: multiple branching crossing dependencies Input: simplified PoS .
Paris Less PoS-tags Simplified PoS PoS: over 300 tags Over 100 for pronouns Not problematic at all, often unique token/tag combinations Not all details necessary for SA Example full tagset T501a VNW(pers,pron,nomin,vol,1,ev) ik (I) T501o VNW(pers,pron,nomin,vol,3,ev,masc) hij (he) Example simplified tagset VNW1 VNW(pers,pron) personal pronoun In graph: both T501a and VNW1 .
Paris Syntactic simplifications Other simplifications Obj2 – indirect object (dative) meewerkend voorwerp Ik geef hem een boek / een boek aan hem (I give him a book) belanghebbend voorwerp Ik koop hem een boek / een boek voor hem (I buy him a book) Bepaling van gesteldheid (~predicative complement) hij verft de deur blauw (he paints the door blue) Hij vindt het boek leuk (he does like the book) Hij nam het boek lachend aan (laughing he accepted the book) .
Paris Results Even then: Annotate did most NPs and PPs very well, but often failed for the more complex parts In some sense surprising as the results for German were much better. However: In that case written language was involved. Training for spoken language is much harder! .
Paris Details CGN corpus Balanced corpus: types of documents (next slide) Speaker characteristics Sex Age Geographic region Socio-economic class Level of education 2/3 Netherlands, 1/3 Belgium (Flanders) Participants were asked to speak standard language (in case they agreed beforehand to participate in CGN) .
Paris Details CGN corpus • ►many types of documents • Read-aloud written: Literature read aloud (library for the blind) • Written to be spoken: • News broadcasts • Lectures • Spoken (spontaneous) • Interviews • Phone calls • Debates • Spontaneous conversations with x people (over lunch etc).
Paris Variation • To some extent differences in written language, much more in spoken variants, esp. in spontaneous speech • Separable verbs • NL dat ze hem op wilde bellen (that she wanted to call him) • VL dat ze hem wilde opbellen • Other choice of auxiliaries • NL Ze is het komen brengen (she came and brought it) • VL Ze heeft het komen brengen • Other words for same concept, same words for different concepts • Pompbak-gootsteen (sink), namiddag (afternoon-late afternoon) Gramm/dictionaries: mostly northern written variant .
Paris Disfluencies • Partially realized words • hilari*a instead of hilarisch (EN hilarious) • Analyzed as if realized • *** • Ik doe West- en Oost-Vlaanderen • I’ll take care of West- and Oost-Vlaanderen • Short for: West-Vlaanderen en Oost-Vlaanderen • Completely regularly analyzed as conjunction (CONJ) • .
Paris Disfluencies • When too little of a token is realized, such a token is ignored • awel genen TV meer en genen boe*a gene voetbal meer . • EN: So no more tv and no more football • .
Paris Ex of disfluency (repetition)
Paris Disfluencies • Mixed repetition/correction • Ze was bijna hileri*a hilari*a • She was almost hilarious • hileri*a is corrected as hilari*a, only the corrected form is included in the analysis • Die verd*a die vervl*a die krankzinnige hond • That damn*, that cursed*, that crazy dog • Only last 3 words (that crazy dog) included in graph • .
Paris Disfluencies Wrong pronunciation Dat is een serieus plobleem*u Dat is een serieus probleem That’s a serious problem Analysed as if the ‘correct’ word was involved ***
Paris Words in foreign language In spoken and written language: Words in another language, and not found in a Dutch dictionary: umbrella*v, plus*v de*v temps*v, à la carte not: rendez-vous, cinema, cognac (in Dutch dictionaries) Single words: just like their Dutch counterpart Strings: only ‘top’ label presented Sentences: not analyzed .
Paris Pro and con markings • Markings (*a, etc) have proven to be useful for PoS and SA. • But: • should have been removed afterwards, i.e. all information should have been contained in tags, orthographic level should contain only orthography • Problem: other groups wanted them at orthographic level for speech recognition purposes • Solution: add a field without markings • .
Paris Syntactic annotation Lacking and superfluous words There are no ‘ungrammatical’ sentences, all sentences are to be analyzed! Lacking elements: just accept it Superfluous elements: just accept it BUT there are some exceptions: repetition ‘accidental’ sentences .
Paris Not analyzed parts Sometimes parts of a ‘sentence’ are ‘ignored’: Reparations Ik zie hem morg*a overmorgen I’ll see him the day after tomorrow Repetitions Hij is in in vergadering He has a meeting Or not connected: ‘accidental’ sentences/units Ik heb nooit ik ben lerares I have never I am a teacher Uh-insertion (hesitation marker) Ze heeft uh zeven dochters She has seven daughters .
Paris Examples • More of the same
Paris Asyndetic conjunction
Paris Discourse phenomena • Some examples of ‘discourse’ within a sentence
Paris Accidental unit • ‘Accidental’ unit, discourse • parts not connected
Paris Syntactic annotation sentence vs discourse
Paris Atypical ‘sentences’ • Often: discourse
Paris Complicating factors • No punctuation apart from full stop, question mark, elipsis • ‘wrong order’ of sentences when more people are talking at the same time! • ►Tricky wrt coreference, temporal reasoning etc • Spelling: incorrect (but correct with other meaning) • U zij de glorie (Thine be the glory) • U zei de glorie (‘zei’ meaning ‘said’) • Ik zal haar eraan houden (houden aan: to keep a promise) • Ik zal haar er aanhouden (aanhouden: to arrest) • ►context, recordings • .
Paris Written corpus: Lassy/SoNaR STEVIN programme (Flemish/Dutch - 2004-2011) D-Coi / LASSY / (SoNaR) 1M SA written text, manually corrected, plus 1.500M SA automatically ALPINO parser (Groningen) Largely inspired by CGN, based on HPSG Some differences Mentioning of ‘hidden’ subjects, objects Hij heeft een boek gekocht .
Paris Alpino • Alpino grammar: HPSG-based • ‘Constructional’ approach: • rich lexical representations • many detailed, construction specific lexical rules (+/- 600) • Grammar based parsing very efficient, esp when combined with specific rules • Large lexicon (100.000+ entries, 200.000+ NEs) • Stored as perfect hash finite automaton (Daciuk) • Crucial: Integrated tagger (=/= CGN tagger!) • Left corner parser
Paris Alpino (as is) and CGN • Parsing the CGN-corpus with Alpino • very bad results • reason might be: it uses a ‘wrong’ grammar, inadequate lexicon etc etc • As we wanted both CGN and Lassy to be searchable using the same tools, CGN was ‘translated’ into the Lassy-format. There are, however, still differences in the way a few phenomena are handled. • .
Paris Lassy vs CGN • Subject/direct objects wrt infinitives and participle • Partitives (one of them said …): in CGN separate label PART, in Lassy combination of HD and MOD • LASSY: head always lexically anchored • In LASSY SBAR-complement always VC-label, in CGN either OBJ1 or VC • … • Analyses not fully identical, but 99% is!
Paris Syntactic annotation: Lassy .
Paris Syntactic annotation: CGN .
Paris To be taken into account • In general: • Take care of IPR • Be prepared to consult other layers • Use a flexible bug reporting system • “Spoken language”: grammar/system should be very flexible • Alignment may be very time consuming • Be aware that, as far as consistency is concerned, not the really hard cases are the most important, but rather those the correctors don’t realize to be problematic (because in those cases they don’t consult others) • GOOD LUCK ! • .