390 likes | 401 Views
Experiences from the Spoken Dutch Corpus. Nelleke Oostdijk. Spoken Dutch Corpus – Corpus Gesproken Nederlands (CGN). Result of a joint Dutch-Flemish project (1 June 1998 until 1 Dec. 2003) Funded by the Dutch and Flemish governments and research foundations (NWO, FWO, AWI/EWI)
E N D
Experiences from the Spoken Dutch Corpus Nelleke Oostdijk
Spoken Dutch Corpus – Corpus Gesproken Nederlands (CGN) • Result of a joint Dutch-Flemish project (1 June 1998 until 1 Dec. 2003) • Funded by the Dutch and Flemish governments and research foundations (NWO, FWO, AWI/EWI) • Total budget: 4.6 million euro • Intended to constitute a language database of contemporary standard Dutch as spoken by adults in the Netherlands and Flanders • Intended use: • for research in various areas, e.g. linguistics, language and speech technology, business communication, language education • (indirectly) for business (SMEs) and teaching • culturally and historically, as a record of Dutch as spoken in the Low Countries around the year 2000
Outline • The Spoken Dutch Corpus: design and compilation • Experiences at the time • Reflections in retrospect While looking back on ambitions and experiences then (1998-2003), what would/might we do differently? • In view of developments since the corpus came into being • Considering the actual use of the corpus • In the light of user feedback • … • OpenCGN • Onwards ...
Ambition A Spoken Dutch Corpus that would be Comparable in size to e.g. the spoken part of the British National Corpus A plausible sample of standard Dutch as spoken in the Netherlands and Flanders Enriched with transcriptions and annotations that were theory-neutral to the extent possible State-of-the-art: Transcriptions, annotations and file formats were to conform to national and international standards and guidelines or ‘bestpractices’ 4
Corpus size targeted: 10 million words Full corpus 1,000 hours of recordings (~10M words) + orthographic transcription + POS tagging, lemmatization and lexicon link-up “Core” corpus 100 hours, transcription and annotation as for full corpus + broad phonetic transcription + syntactic annotation + manually verified word-signal alignment + (25 hours) prosodic annotation
Prospective users: conflicting interests Prospective user groups/stakeholders included • Users from the fields of descriptive and theoretical linguistics and communication studies, incl. phoneticians, syntacticians, discourse specialists, … • Users from the fields of computational linguistics • Users from the field of speech recognition Had conflicting ideas about • The content of the corpus, e.g. • Audio quality • Continuous speech vs words/phrases • Spoken vs Read aloud • Situational contexts • The transcriptions and annotations
Corpus design considerations To be made in the light of the prospective users, the intended use as well as budgetary and other constraints (e.g. duration of the project, availability of qualified personnel, availability of suitable tools, ethical standards, …) As regards • Corpus and subcorpus sizes • Composition • Sampling • Formats • Metadata • Audio quality • Transcriptions and annotations
Corpus design Overall structure is defined on the basis of parameters that distinguish specific communicative / situational settings: • Number of speakers: monologues, dialogues, multilogues • Public (with or targeted at audience) or private • Medium: radio, television • Degree of preparedness • Direct (‘face-to-face’) vs distanced (telephone)
dialogue / multilogue 8,110,000 private 6,635,000 spontaneous 6,635,000 ‘direct’ 3,460,000 conversations (face-to-face) 3,000,000 interviews 460,000 ‘distanced’ 3,175,000 telephone convers. 3,000,000 business transactions 175,000 public 1,475,000 broadcast 750,000 more or less prepared 750,000 interviews and discussions 750,000 non-broadcast 725,000 spontaneous 725,000 discussions, debates, meetings 375,000 lessons 350,000 monologue 1,890,000 Corpus design: composition (1)
dialogue / multilogue 8,110,000 monologue 1,890,000 private 40,000 more or less prepared 40,000 descriptions of pictures 40,000 public 1,850,000 broadcast 950,000 spontaneous 250,000 commentary 250,000 more or less prepared 700,000 current affairs programmes 250,000 news 250,000 opinion programmes,commentaries 200,000 non-broadcast 900,000 more or less prepared 900,000 lectures, speeches 275,000 read aloud text 625,000 (+ 375,000) Corpus design: composition (2)
Metadata • Information about the speakers, e.g. • Gender • Age • Regional background • Educational background • Information about the samples, incl. • Recording conditions • Medium • Length • Number and IDs of speakers involved • Available transcriptions and annotations
Concerns • Naturalness/sponteneity of the speech • Quality of the audio recordings • Coverage of various registers and speaker populations • Richness of metadata In view of practical complications, e.g. • In order to allow for distribution of the corpus, speakers’ consent is required (as well as settlement of other IPR matters) • In many situations you have only limited control over the recording conditions • Not all recordings were made by CGN project • Speaker recruitment can be problematic, esp. for elderly people and people with lower education
Realization • Joint Dutch-Flemish project • Distributed project: involving several universities, carried out in several locations • Multiple transcription and annotation layers • Later transcription and annotation layers benefit from preceding annotations • Interaction between annotation layers provides checks and balances (quality, consistency) during the creation of the corpus • Allow you to make most of the data when using the corpus (e.g. orthographic + phonetic transcription) • Theory-neutral (to the extent possible)
Orthographic transcription Importance • Simplest access to speech data • Available for all data • Base for other transcriptions and annotations Properties • Verbatim transcription • Minimal interpretation • Alignment with speech signal (the marking of chunks facilitates subsequent word alignment) • Transcription useful for broad range of researchers
Orthographic transcription Rules • In principal conventional spelling • Transcription of false starts, hesitations, grammatical errors, etc. (+ codes) • No capital letters to signal the beginning of sentences • Proper names (and parts of names) are capitalised • Punctuation restricted to [.] [?] and […] (Limited number of) codes, e.g. • xxx for unintelligible speech • ggg for speaker sounds • *a for unfinished words
Orthographic transcription Problems experienced • New words: dikkemurenkasteel, flunken, haatzaaiartikelen, downgeloade, geplaystationd • +/- word separation, e.g. separable verbs ervan uitgaan – er vanuit gaan – er van uitgaan • Reduced forms • Dialect words or words pronounced with a regional accent • … • Speaker identification
POS tagging and lemmatization • Available for all data: contextually appropriate tag for each word form (token) • POS tagging and lemmatization enable searches for word classes and lemmas and not only for word forms; e.g.naar – ADJ or PREP; fiets – N or WW; lemma ‘fietsen’: fiets, fietsen, fietste, fietsten, gefietst • Queries can remain underspecified for certain aspects; e.g.find all occurrences of naar as PREP followed by an N: naar huis, naar bed,naar keuze, naar school, …
POS tagging and lemmatization Procedure • Automatically, using a tagger-lemmatiser • Check, and possibly manually correct, output Tagset • Especially designed for CGN • Conforms to EAGLES guidelines and ANS • 316 tags (incl. tags for dialect words and * words from orthography) Principles • Word by word; e.g. hij belde hem op • Form over function; e.g. ik heb haar maandag gezien
POS tagging and lemmatization Problems during checking and correction • Errors in orthographic transcription • Easy to miss occasional errors • When what is said deviates from what is prescribed by grammar • Notoriously difficult cases, e.g. - Distinction ADJ – V (ingp/edp) - Idioms - Different tags for a word (token), depending on context
Lexicon link-up Lemmatization of multi-word units • multi-part proper namese.g. Kim Clijsters, … • separable verbse.g. achteruitdeinzen – deinzde … achteruitdichtmaken – maakte … dicht navertellen – vertelde … na • foreign multi-word expressionse.g. pro forma, et cetera, chili con carne
Phonetic transcription • Available for about 1 million words • Broad phonetic: representation of the phonemes that are being pronounced using a limited set of symbols (e.g. no diacritic symbols) • Results from the manual correction of automatically generated transcriptions:- Transcription of phoneme insertions, deletions and substitutions- No transcription of gradual processes such as degree of voicing • Symbol set: Dutch SAMPA set with extras • /J/ for <oranje> • /E:/, /Y:/, /O:/ for resp. <scène>, <freule> en <zone> • /E~/, /Y~/, /O~/, /A~/ for resp. <vaccin>, <parfum>, <congé> and <croissant> Symbols under 2 en 3 can only by used in loan words
Phonetic transcription General transcription rules • Make sure that there is a one-to-one relation with the orthography • Make sure that the transcription shows which phonemes were pronounced • If in doubt: do not change the automatic transcription • Note if words have to be inserted, removed or substituted Specific transcription rules • If two adjacent words share a phoneme, use an underscore/Als_sInt kOmt/, /Ob_bAIEt/ • Use hyphens to link connecting phonemes with preceding and following words/pApa-n-Em_mAma/ /du-w-@t/ • Use [] for untranscribable phonemes or words
Phonetic transcription What transcribers find difficult • Syllable-final /r/: deleted or not? • Syllable-final /n/: deleted or not? • Plosives and fricatives: voiced or voiceless? (esp. distinction /G/ - /x/) • Plosives without release: voiced or voiceless? • Use of /S/, /Z/, /J/: is forgotten
Syntactic annotation • Available for approx. 1 million words • Dependency structure • Semi-automatic annotation using @nnotatesoftware (cf. NEGRA corpus)
Prosodic annotation • Available for about 25 hours of speech (~ 250,000 words) • 2 annotations per sample • As much as possible theory-neutral • Perception-based (cf. Portele & Heuft; Grover et al.) • Annotation of 1. prominent syllables, i.e. syllables that are stressed to make a word more important or to indicate contrast with another word2. prosodic boundaries, strong and weak3. abnormal lengthening of phonemes
Pronunciation variation: eigenlijk (EN: ‘actually’) Canonical form vs actually observed pronunciations (from rather careful to highly sloppy)
Pronunciation variation: regional differences For example, between Flanders and the Netherlands: dat (EN: that) subordinating conjunction OR demonstrative pronoun
What CGN project has brought us: Tangible results • A sizeable corpus of spoken Dutch (~ 800 hours recorded speech) with transcriptions, annotations, metadata and documentation • Word frequency lists • COREX: Corpus Exploitation environment Available from the Dutch HLT Centre!
But also ... CGN has had enormous impact on the development of • Standards (e.g. dataformats, metadata specifications, definition of tagset, adaptation/validation of SAMPA for Dutch) • Tools (e.g. PRAAT, tagger/lemmatizer, word alignment software, syntactic parser) • Guidelines for e.g. handling IPR, various types of transcription and annotation CGN has set an example for other projects that have benefitted from the pioneering work
Dutch language resources before the Spoken Dutch Corpus Before 1998 • Data • Textcollections held by the Institutefor Dutch Lexicology (forlexicological and lexicographicalpurposes) • Corpus Uit den Boogaart (word frequencies) • Private collections of individualresearchers (small, widely diverse, no IPR) • Toolsfor Dutch: hardlyany 1998-2003 Spoken Dutch Corpus project
Dutch language resources since the Spoken Dutch Corpus 1998-2003 Spoken Dutch Corpus project Influential as regards corpus design, standardization, IPR, tool development 2003- 2003-2004 E-lexicon 2005-2006 Dutch Language Corpus Initiative (D-Coi) project 2004-2008 Jasmin (children, non-natives, elderly people for HLT applications) 2006 CoDAS (Corpus of Dutch Aphasic Speech) 2008-2011 STEVIN Nederlandstalig Referentiecorpus (SoNaR) project 2009-2012 BasiLex (corpus of texts for Dutch school children) 2013-2015 BasiScript (corpus of texts written by Dutch school schildren) 2013-2014 CLARIN-NL Data curation service 2015 OpenCGN 2015- CLARIAH, incl. CLARIAH DCS
Reflections in retrospect While looking back on ambitions and experiences then (1998-2003), what would/might we do differently? • In view of developments since the corpus came into being • Considering the actual use of the corpus • In the light of user feedback 34
Experiences from actual practice Use conventional spelling Mark chunks Conform to EAGLES guidelines Time spent on development of protocols Use tools to generate initial annotation and then do post-editing (e.g. POS tagging/lemmatization, phonetic transcription, syntactic annotation, word alignment) Timely compilation of the lexicon so that it could optimally support the various transcription and annotation processes was not achieved In view of the project duration some types of transcription or annotation had to be carried out in parallel Prosodic annotation not for non-specialists 35
Things we would do differently? • In view of developments since the corpus came into beingavailability of standards, tools and other resources, guidelines, … • Considering the actual use of the corpusWhat has corpus been used for (and what not, although we had anticipated that)? • In the light of user feedback: spontaneity lacking, corpus not suited for conversation studies, groups of speakers badly represented or not included at all, ...
OpenCGN (2015, ongoing) Aim:To make SoNaR and CGN data available for exploitation within a single environment Funding: CLARIN NL Data curation includes: • Audio: Conversion of WAV to MP3 • CGN XML converted to FoLiA • Metadata: harmonize with SoNaR metadata • Check to what extent current output of FROG conforms to tagset specified and documented in manual
The CGN lexicon • Consists of a single word and a multi word lexicon: Single word lexicon 181,579 word form entries (type-word class pairs) 229,104 entries (incl. syntactic patterns) Multi-word lexicon 23,567 unique multi-word expressions 18,593 unique multi-word lemmas 53,704 multi-word entries • Comprises all types of word forms that occur in the corpus • Contains information about spelling, word class, lemma, canonical pronunciation, subcategorization, status, etc.
The CGN lexicon Lexicon Status B = southern Dutch INF = informal *d = dialect *u = (possibly deliberate) mispronunciation *v = foreign word without loan word status*x = unintelligible word *z = word pronounced with a strong regional accent, transcribed in standard Dutch spelling Corpus StatusC = correct spelling of corpus type I = incorrect spelling of corpus type O = non-validated spelling of corpus type V = validated spelling of corpus type