200 likes | 344 Views
BAStat : New Statistical Resources at the Bavarian Archive for Speech Signals. Florian Schiel Bavarian Archive for Speech Signals Institute of Phonetics and Speech Processing Ludwig-Maximilians-Universität München, Germany. }. OnFocus / OffFocus. Outline. Motivation
E N D
BAStat : New Statistical Resources at the Bavarian Archive for Speech Signals Florian Schiel Bavarian Archive for Speech Signals Institute of Phonetics and Speech Processing Ludwig-Maximilians-Universität München, Germany
} OnFocus / OffFocus Outline • Motivation • Structure and Format • Sources • Phones • Syllables • Words • CELEX <-> BAStat
LR Machine Learning } OnFocus / OffFocus Customized LRs Motivation Traditional Usage of Language Resources (LRs) ExperimentModel Application …
ExperimentModel Application … Machine Learning } OnFocus / OffFocus Motivation Better: Usage / Recycling of standard LRs LR LR LR LR LR repository (e.g. LDC, ELDA)
ExperimentModel Application … Statistic } OnFocus / OffFocus e.g. CELEX, BAStat Motivation Even Better: Usage of statistical LRs based on recycled LRs LR LR LR LR Recycled LRs
phones syllables words duration duration duration } monograms monograms monograms OnFocus / OffFocus bigrams bigrams bigrams BAStat: Structure, Format BAStat = statistics derived from multiple speech corpora of conversational German BAStat word position position / accent function word pronunciation
} OnFocus / OffFocus BAStat: Structure, Format BAStat = free available at BAS BAStat 7-bit ASCII tables probability matrices (scientific notation) non-ASCII coding : LaTeX phonetic coding : SAM-PA www.bas.uni-muenchen.de/Bas/BAStat.html
corpus speakers setting word tokens RVG1 Verbmobil 2 Verbmobil 1 SmartKom 450 233 780 259 dialogue WOZ interview dialogue 63162 285280 153438 55681 } OnFocus / OffFocus BAStat : Source LRs BAStat = based on conversational German BAStat
orth. transcript / tagging lexicon phonetic segmentation syllabification Verbmobil (manually) SAM-PA (manually) MAUS (automatic) U. Reichel (automatic) } OnFocus / OffFocus BAStat : Source LRs BAS standard annotation and segmentation BAStat
} OnFocus / OffFocus BAStat : Phone Statistic • two phoneme sets: basic (52) + extended (76) including all possible vocalized /r/ diphthongs (e.g. /E6/ (‚er‘), /u:6/ (‚Uhr‘) etc.) • phone probability P(phon) • phone bigram probability P(phon2|phon1) • position probability: word initial / medial / final • duration statistics: total, word initial / medial / final
BAStat : Phone Statistic Examples: estimates for phone sequences Probability of phone sequence /En/ vs. /an/ ? P(En) = P(n|E) P(E) = 0.00238 P(an) = P(n|a) P(a) = 0.0129 Probability of word-final phone sequence /vOYs/? P = P(OY|v) P(s|OY) P(v) P(word-final|s) = 1.8 10-9
BAStat : Syllable Statistic • rule-based syllabification of phonetic transcript • tagging: lexically accented + function word • list of syllable tokens (1038588) • statistics of syllable types (6397) • probability P(syl) • bigram probability P(syl2|syl1) • word position probability • duration statistics • probability of lex. accentuation / function word
ja (yes) ich (I) wir (we) BAStat : Syllable Statistic Top ranking German syllables -> German mostly talk affirmative about themselves!
BAStat : Syllable Statistic Syllable duration / coverage 94.4% rank1000 0.21sec
BAStat : Word Statistic • probabilities, bigrams based on 689966 tokens • ‚para-words‘ : silence, filled pauses, laugh, articulatory noise, spelling, breath • pronunciation statistics (16431 word types) • citation form + variants • transcripts in SAM-PA • probability P(pronunciation|wordtype)
citation form BAStat : Word Statistic Example: pronunciation of ‚Abend‘ (evening)
CELEX <-> BAStat CELEX: lexical database (Baayen et al, 1995)based on text corporaphonological syllables
v v CELEX <-> BAStat Top syllable ranking Top 1000 ranking syllables: 47.5% overlap
Conclusion • statistics of German phones, syllables, words • based on conversational speech corpora • better statistical representation of speech • applicable for experimental design (e.g. exemplar theorie) or technical speech processing (e.g. ASR, synthesis) • continually extension by new source corpora • hopefully similar resources for other languages
The End www.bas.uni-muenchen.de/Bas/BAStat.html