1 / 18

The 10-milion-words Spoken Dutch Corpus and its potential use in experimental phonetics

The 10-milion-words Spoken Dutch Corpus and its potential use in experimental phonetics. Louis C.W. Pols Institute of Phonetic Sciences University of Amsterdam. 100 Years of Experimental Phonetics in Russia St.-Petersburg State Univ., Febr. 1-4, 2001. Amsterdam city center. Herengracht 338.

Download Presentation

The 10-milion-words Spoken Dutch Corpus and its potential use in experimental phonetics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The 10-milion-words Spoken Dutch Corpus and its potential use in experimental phonetics Louis C.W. Pols Institute of Phonetic Sciences University of Amsterdam 100 Years of Experimental Phonetics in Russia St.-Petersburg State Univ., Febr. 1-4, 2001

  2. Amsterdam city center Herengracht 338

  3. Overview • Introduction • Corpus design, recording, digitization • Orthographic transcription • Part-of-speech tagging, lemmatization and syntactic annotation • Phonetic transcription • Prosodic transcription • Exploration • Potential phonetic benefit

  4. Introduction • appropriate topic given long Russian tradition • Dutch-Flemish initiative • 10 Mƒ, 10 M words (about 1000 hrs of speech) • start June 1998, 5 yrs, 7 releases (audio + ann.) • many speaking styles, also over telephone, only adult speakers, ABN variants but no dialect • for linguistics and speech/language technology • rights with NTU (http://www.taalunie.nl)

  5. Corpus design(number of words x 1000) dialogues and multilogues monologues

  6. Recording, digitization • mono or stereo using portable DAT-recorders • 16 kHz and 16 bit (telephone recordings at 8 kHz and 8 bit) • .WAV format in PRAAT • meta data about recording and speaker • 7 audio releases on CD-ROM, or DVD (future?) • annotations updated with each release

  7. Orthographic transcription (1) • by trained students, checked by expert • according to fixed protocol; no text interpretations • transcr. aligned at few sec. chunks; multiple tiers • few punctuations; capitals for names only • standard spelling conventions, checked vs. lexicon • special mark-up symbols: • *d dialect words; *z regionally accented words • *t interjection; *a truncated wrd; *u mispronunciation • *v foreign words; *n new words; *x hardly intelligible • ggg speaker sounds; xxx unintelligible word(part)(s)

  8. Orthographic transcription (2)

  9. Part-of-speech tagging • all words in the text automatically tagged • discontinuous verbs not recognized at this level • Dutch tag set with 10 major word classes (noun, adjective, verb, pronoun, article, numeral, preposition, adverb, conjunction, and interjection) • additional morpho-syntactic features per class (e.g., singular, dimunitive and neuter for nouns) • resulting in some 300 tags • self-learning automatic tagger (given context)

  10. Lemmatization • all words autom. paired with base form (lemma) • verbs  infinitive (gedaan  doen) other forms  stem (vijfde  vijf) truncated forms  full forms (z’n  zijn) • base form must be an independently existing form (hersenen  hersen; meisje  meis) • discontinuous verbs and split prepositions are not recognized at this level (op...bellen; van...uit) • one and only one baseform per word (vliegen  verb vliegen, or noun vlieg, depending POS)

  11. Broad phonetic transcription (1) • on 10% of the data (mainly dialogues) • hand correction of automatic phonetic transcription • across-word assimilation, levels of reduction? • use of extended SAMPA • within PRAAT • word level respected die ik wel vind dat ze kloppen  di k wEl fInt_tAt s@ klOp@ • no hand segmentation at phoneme level

  12. Broad phonetic transcription (2)

  13. Signal coupling, word alignment • the phonetically transcribed part (1 M words) will be automatically aligned at word level • using ASR techniques (forced alignment) • this word alignment will be hand corrected • pauses and noises will also be aligned • geminate plosives are aligned separately, others shared (komt terug  kom t erug; is zeker  isseker) • inserted phonemes are shared with neighbouring words (toen belde n ie naar huis  belden nie • all the rest may be automatically aligned only • few seconds chunks are always accessible

  14. Syntactic annotation • 10% will be semi-automatically annotated • procedure still under developed • interactive annotation software from NEGRA project (Saarbrücken) will be used • taking into account idiosyncracies of speech, such as hesitations, false starts, clause extensions • functional information (dependency labels) • category information (in form of node labels)

  15. Prosodic annotation • manually, on 250K words subset only • procedure still under development • prosodic markers in orthography • 1) prosodic boundaries long silences () phrase boundaries () other discontinuities, like (filled) pauses (%) • 2) prominence (^ before vowel in prominent syllable) sp. A: n^ee  Jan heeft n^egen % medailles  z^even medailles.  sp. B: z^even 

  16. Exploration software • COREX tool under developed (Max Planck Inst.) • both locally and internet-based (Java) • 1) browser • 2) viewer for orthography and annotations, plus waveform display and audio player (time synchr.) • 3) search module, also on meta data

  17. Potential phonetic benefit • huge database, many speakers/styles,‘real’ speech • easily accessible via orthography, plus audio • partly accessible via phonetic transcription • no segmentation at phoneme level (automatic?) • automatic segmentation at word level • after COREX search: own additions possible • f.i. spectro-temporal analyses via PRAAT scripts • f.i. svarabhakti vowel, final n-deletion, assimilation • f.i. vowel reduction, turn-taking behavior, etc.

  18. More information • see references in paper • see websites mentioned in paper • second release Oct. 2000 • new releases every half year • feedback from users group (workshops) • useful for proposed INTAS project “Spontaneous speech of typologically unrelated languages (Russian, Finnish and Dutch): Comparison of phonetic properties” (De Silva, 2000)

More Related