100 likes | 243 Views
A quick walk through phonetic databases. Read English TIMIT Boston University Radio News Spontaneous English Switchboard ICSI transcriptions Buckeye Corpus (VIC). TIMIT. Read phonetically balanced sentences Good coverage of different phonetic environments
E N D
A quick walk through phonetic databases • Read English • TIMIT • Boston University Radio News • Spontaneous English • Switchboard ICSI transcriptions • Buckeye Corpus (VIC)
TIMIT • Read phonetically balanced sentences • Good coverage of different phonetic environments • Does not exhibit more radical reductions, dysfluencies seen in spontaneous speech • Transcribers started from forced alignments, realigned • Roughly 5 hours of speech • 630 speakers, 8 dialects, 10 sentences apiece • Uses ARPAbet symbols • Separate stop/closure symbols • Symbol for epenthetic stop • Cost: $100 for non-1993 LDC members
BU Radio Corpus • Radio announcers reading news • 4 male, 3 female; reading in both “non-studio” and “studio” voices • Originally intended for speech synthesis work • Marked with prosody in addition to phonetics • Marked with ARPAbet (similar to TIMIT) • > 7 hours of speech • Cost: $400 for non-1996/1997 LDC members
Switchboard ICSI Transcriptions • Spontaneous speech, many dialect regions • Transcribed “segmented turns,” some of which may be cutoffs, from 2-party conversations • 4 hours of speech transcribed • 2 stages: • Initial 1 hour phonetically transcribed • Hours 2-4 phonetic markers, syllable boundaries -- back aligned with phonetic markers • Similar phoneset to TIMIT • No separate closure/release • Voiced hesitations (pn/pv) • Cost: possibly free, possibly $2k for non-1993/7
VIC (Buckeye) Corpus • Spontaneous interview speech • Age, gender balanced • All speakers from Ohio • Currently in transcription • NIH grant involving Keith, me, and Mark Pitt • 10 hours completed, 30 hours total • Based on ARPAbet with a few additions • Nasalized vowels, glottal stop replacing /t/,… • Cost: free (to us) -- might need to work out licensing but shouldn’t be an issue.
Evaluating with Corpora • Clear thing to do is to start with TIMIT • Facilitates comparison with other things • However, we should really try to insert spontaneous data into research ASAP • Maybe move to some combination of TIMIT/SWB/VIC? • Only talked about (American) English • Other languages in year 4? • Chin has done some work in Mandarin? • CASS corpus: phonetically transcribed, but available?