1 / 75

Phonetic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shuangyu Chang Inter

Phonetic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shuangyu Chang International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 {steveng, shawnc}@icsi.berkeley.edu http://www.icsi.berkeley.edu/~steveng

kenyon
Download Presentation

Phonetic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shuangyu Chang Inter

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phonetic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shuangyu Chang International Computer Science Institute 1947 Center Street, Berkeley, CA 94704 {steveng, shawnc}@icsi.berkeley.edu http://www.icsi.berkeley.edu/~steveng Large Vocabulary Continuous Speech Recognition Workshop Maritime Institute of Technology, Linthicum Heights, MD, May 4, 2001

  2. PHONETIC CLASSIFICATION APPEARS TO BE A PRIMARY FACTOR UNDERLYING THE ABILITY TO CORRECTLY RECOGNIZE WORDS Many different analyses (to follow) support this conclusion Consonants appear to be more important than vowels SYLLABLE STRUCTURE IS ALSO AN IMPORTANT FACTOR FOR ACCURATE RECOGNITION The pattern of errors differs across the syllable (onset, nucleus, coda) and exhibit consistent patterns difficult to discern with other units of analysis STRESS-ACCENT MAY PLAY AN IMPORTANT ROLE, PARTICULARLY FOR UNDERSTANDING THE NATURE OF WORD-DELETION ERRORS Relation among stress-accent, syllable structure, vocalic identity and length THE NATURE OF PRONUNCIATION MODELS and THEIR RELATION TO LEXICAL REPRESENTATIONS IS A POTENTIALLY KEY FACTOR The unit of lexical representation (phones, articulatory features, etc.) is probably of the utmost importance for optimizing ASR performance FUTURE PROGRESS IN ASR SYSTEM DEVELOPMENT IS LIKELY TO DEPEND ON DEEP INSIGHT INTO THE NATURE OF SPOKEN LANGUAGE Take Home Messages

  3. DESCRIPTION OF THE CORPUS MATERIALS FOR THE 2000 AND 2001 EVALUATIONS 2000 – Brief (2-17 s) utterances spoken by hundreds of different speakers. No relation to competitive evaluation 2001 – A subset of the competitive evaluation BRIEF OVERVIEW OF THE ANALYSIS REGIME COMMON TO THE 2000 AND 2001 PHONETIC EVALUATIONS File formats, time-mediated alignment, statistical analysis of the corpora, etc. Details are contained in “Linguistic Dissection …..” (in workshop notebook) and in “An Introduction ….” (NIST Speech Transcription Workshop, 2000) ANALYSES AND PATTERNS COMMON TO BOTH 2000 and 2001 EVALUATIONS Syllable structure, phonetic segments, articulatory-acoustic features. Details pertaining to the 2000 evaluation are in the papers cited above PHONETIC CONFUSION MATRICES FOR THE 2001 EVALUATION FUTURE ANALYSIS PLANNED FOR THIS SPRING WHEN REMAINING 2001 SUBMISSIONS ARRIVE Relationship between phonetic classification, pronunciation and language models Structure of the Presentation

  4. SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS Switchboard contains informal telephone dialogues 54 minutes of material that previously phonetically transcribed (by highly trained phonetics students from UC-Berkeley) All of this material was hand-segmented at either the phonetic- segment or syllabic level by the transcribers The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified. THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED IN THE    CURRENT PROJECT ARE AVAILABLE ON THE PHONEVAL WEB SITE: http://www.icsi.berkeley.edu/real/phoneval THE ORIGINAL FOUR HOURS OF TRANSCRIPTION MATERIAL    ARE AVAILABLE AT: http://www.icsi.berkeley.edu/real/stp Evaluation Material - 2000

  5. 581 DIFFERENT SPEAKERS AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS BROAD DISTRIBUTION OF UTTERANCE DURATIONS 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10% COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN SWITCHBOARD A WIDE RANGE OF DISCUSSION TOPICS VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD) Evaluation Material Details - 2000 By Dialect Region By Subjective Difficulty Number of Utterances Subjective Difficulty Dialect Region

  6. SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS Seventy-four minutes of material phonetically labeled by five highly trained phonetics students from UC-Berkeley plus S. Greenberg The material was hand-segmented at the syllabic level by the transcribers The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained originally on 72-minutes of hand-segmented Switchboard material (similar to the process performed the previous year) THE PHONETIC SYMBOL SET and STP TRANSCRIPTIONS USED ARE AVAILABLE ON THE PHONEVAL WEB SITE: http://www.icsi.berkeley.edu/real/phoneval Evaluation Material - 2001

  7. A SUBSET OF THE HUB-5 COMPETITIVE EVALUATION CORPUS A representative selection from the evaluation set, including an even distribution of data from the three main recording conditions (cellular and 2 land-line conditions) 21 SEPARATE CONVERSATIONS (2 speakers per conversation) 42 DIFFERENT SPEAKERS A TOTAL OF 74 MINUTES OF SPOKEN LANGUAGE MATERIAL (including FILLED PAUSES, JUNCTURES, etc.) AVERAGE LENGTH OF SPEECH PER SPEAKER – 106 seconds RANGE OF LENGTH PER SPEAKER – 48 s (least) to 226 s (most) STANDARD DEVIATION – 38 s APPROXIMATELY ONE-THIRD OF THE MATERIAL FROM CELL PHONES Evaluation Material Details - 2001

  8. EIGHT SITES PARTICIPATED IN THE EVALUATION All eight provided material for the unconstrained-recognition phase Six sites also provided sufficient forced-alignment-recognition material (i.e., phone/word labels and segmentation given the word transcript for each utterance) for a detailed analysis AT&T (forced-alignment recognition incomplete, not analyzed ) Bolt, Beranek and Newman Cambridge University Dragon (forced-alignment recognition incomplete, not analyzed ) Johns Hopkins University Mississippi State University SRI International University of Washington Evaluation Sites - 2000

  9. SEVEN SITES ARE PARTICIPATING IN THE EVALUATION Unconstrained-recognition phase – 6 Sites Forced-alignment – 7 Sites Phone classification confidence scores – 5 Sites Variable condition recognition – 2 Sites Phone strings to words - 1 Site AT&T Bolt, Beranek and Newman IBM Johns Hopkins University Mississippi State University Philips SRI International Evaluation Sites - 2001

  10. However … NOT ALL OF THE MATERIAL REQUIRED TO PERFORM THE ANALYSES HAVE MATERIALIZED The tables below summarize the commitments and currently usable data (certain data arrived in not-quite-ready-for-prime-time form) Evaluation Data Status - 2001 Commitments Current (usable data)

  11. Parameter Key START - Begin time (in seconds) of phone DUR - Duration (in sec) of phone PHN - Hypothesized phone ID WORD - Hypothesized Word ID Format is for all 674 files in the evaluation set (Example courtesy of MSU) Initial Recognition File - Example

  12. EACH SUBMISSION SITE USED A (QUASI) CUSTOM PHONE SET Most of the phone sets are available on the PHONEVAL web site THE SITES’ PHONE SETS WERE MAPPED TO A COMMON “REFERENCE” PHONE SET The reference phone set is based on the ICSI Switchboard transcription material (STP), but is adapted to match the less granular symbol sets used by the submission sites The set of mapping conventions relating to the STP (and reference) sets are also available on the PHONEVAL web site THE REFERENCE PHONE SET WAS ALSO MAPPED TO THE SUBMISSION SITE PHONE SETS This reverse mapping was done in order to insure that variants of a phone were given due “credit” in the scoring procedure For example - [em] (syllabic nasal) is mapped to [ix] + [m], the vowel [ix] maps in certain instances to both [ih] and [ax], depending on the specifics of the phone set Phone Mapping Procedure

  13. TWO METHODS WERE USED FOR THE 2001 EVALUATION The “UNCOMPENSATED” form is the same as last year’s scoring method. Only common phone ambiguities (such as [ix], [ih], [ah]. [ax], etc. are allowed The “TRANSCRIPTION-COMPENSATED” form allows for certain phones commonly confused among human transcribers to be scored as “correct,” even though they would otherwise be scored as “wrong” The compensated form of transcription lowers the phone “error” by ca. 10-20% TIME-MEDIATED SCORING WAS OF TWO VARIETIES A “STRICT” form is identical to that used in last year’s evaluation. There is a severe penalty for deviations from time boundaries for words and phones A “LENIENT” form allows for a much looser fit between time markers associated with words and phones. A weighting of 0.15 (relative to the STRICT form) was used (by modifying the penalty algorithm in SC-Lite). The 0.15 weight reduced the number of phone “errors” by ca. 20% without a significant decline in false-positive responses Phone Scoring Procedures - 2001

  14. Visualization of a 3-D Confusion Matrix • When the matrix is sparsely coded, as below, it is more efficient to view the pattern as if squashed against a brick wall (see below) The diagonal is plotted in a linear plane

  15. Interlabeler Agreement (74%) - 3 Transcribers • Highest for consonants (especically the stops) • Lowest for vowels (particularly the lax monophthongs) Vowels Proportion Concordance Consonants Phonetic Segment Numbers refer to the concordance diagonal in the confusion matrices

  16. INTERLABELER DISAGREEMENT PATTERNS WERE DERIVED FROM THE 2000 EVALUATION MATERIAL Several minutes of 3 transcribers material transcribed in common were analyzed (2 from 1996-1997 STP, 1 from 2001 STP) THE FOLLOWING PATTERNS WERE OBSERVED IN THE INTERLABELER DISAGREEMENT ANALYSIS Consonants Stop and nasal consonants exhibit a small amount of disagreement Fricatives exhibit slightly higher amounts of disagreement Liquids show a moderate amount of disagreement Vowels Lax monophthongs exhibit a high amount of disagreement Diphthongs show a relatively small amount of disagreement Tense, low monophthongs show relatively little disagreement (except for [ao] (probably a dialect issue) Overall Transcriber Agreement was 70% Interlabeler Disagreement Patterns - 2001

  17. FROM SUCH PATTERNS THE FOLLOWING FORMS OF TOLERANCES WERE ALLOWED IN “TRANSCRIPTION COMPENSATED” SCORING: Interlabeler Disagreement Patterns - 2001 Segment [d] [k] [s] [n] [r] [iy] [ao] [ax] [ix] UNcompensated [d] [k] [s] [n] [r] [iy] [ao] [ax] [ix] [ih] [ax] Compensated [d] [dx] [k] [s] [z] [n] [nx] [ng] [en] [r] [axr] [er] [iy] [ix] [ih] [ao] [aa] [ow] [ax] [ah] [aa] [ix] [ix] [ih] [iy] [ax]

  18. Transcription Compensation Affects Phone Error • COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS LOWERS THE PHONE “ERROR” APPRECIABLY FOR MOST SITES STRICTTime Mediation Error Rate

  19. Transcription Compensation Affects Phone Error • COMPENSATING FOR TRANSCRIPTION CONFUSION PATTERNS LOWERS THE PHONE “ERROR” APPRECIABLY FOR MOST SITES LENIENTTime Mediation Error Rate

  20. Generation of Evaluation Data - 1

  21. EACH SITE’S MATERIAL WAS PROCESSED THROUGH SC-LITE TO OBTAIN A WORD-ERROR SCORE AND ERROR ANALYSIS (IN TERMS OF ERROR TYPE) CTM File Format for Word Scoring ERROR KEY C = CORRECT I = INSERTION N = NULL ERROR S = SUBSTITUTION

  22. Generation of Evaluation Data - 2

  23. LEXICAL PROPERTIES Lexical Identity Unigram Frequency Number of Syllables in Word Number of Phones in Word Word Duration Speaking Rate Prosodic Prominence Energy Level Lexical Compounds Non-Words Word Position in Utterance SYLLABLE PROPERTIES Syllable Structure Syllable Duration Syllable Energy Prosodic Prominence Prosodic Context Summary of Corpus Acoustic Properties • PHONE PROPERTIES • Phonetic Identity • Phone Frequency • Position within the Word • Position within the Syllable • Phone Duration • Speaking Rate • Phonetic Context • Contiguous Phones Correct • Contiguous Phones Wrong • Phone Segmentation • Articulatory Features • Articulatory Feature Distance • Phone Confusion Matrices • OTHER PROPERTIES • Speaker (Dialect, Gender) • Utterance Difficulty • Utterance Energy • Utterance Duration

  24. Word- and Phone-Centric “Big Lists” • THE “BIG LISTS” CONTAIN SUMMARY INFORMATION ON 55-65    SEPARATE PARAMETERS ASSOCIATED WITH PHONES,    SYLLABLES, WORD, UTTERANCES AND SPEAKERS    SYNCHRONIZED TO EITHER THE WORD (THIS SLIDE) OR THE PHONE

  25. Generation of Evaluation Data - 3

  26. RECOGNITION FILES Converted Submissions ATT, BBN , JHU, MSU, SRI, WASH Word Level Recognition Errors ATT, CU, BBN , JHU, MSU, SRI, WASH Phone Error (Free Recognition) ATT, BBN, JHU, MSU, WASH Word Recognition Phone Mapping ATT, BBN, JHU, MSU, WASH BIG LISTS Word-Centric ATT, CU, BBN, JHU, MSU, SRI, WASH Phone-Centric ATT, BBN, JHU, MSU, WASH Phonetic Confusion Matrices ATT, BBN, JHU, MSU, WASH Phoneval-2000 Web Site FORCED ALIGNMENT FILES • Forced Alignment Files BBN , JHU, MSU, WASH • Word-Level Alignment Errors BBN , CU, JHU, MSU, SRI, WASH • Phone Error (Forced Alignment) CU, BBN, JHU, MSU, SRI, WASH • Alignment Word-Phone Mapping BBN , JHU, MSU, WASH BIG LISTS • Word-Centric BBN, CU, JHU, MSU, SRI, WASH • Phone-Centric BBN, JHU, MSU, WASH • Phonetic Confusion Matrices BBN, JHU, MSU, WASH • Description of the STP Phone Set • STP Transcription Material Phone-Word Reference Syllable-Word Reference • Phone Mapping for Each Site ATT, BBN , JHU, MSU, WASH STP-to-Reference Map STP Phone-to-Articulatory-Feature Map http://www.icsi.berkeley.edu/real/phoneval

  27. A Syllable-Centric Perspective In this presentation we will “drill down” from the lexical to the phonetic tiers by way of the syllable, the phone and articulatory-acoustic features Words Stress-accent Phonetic segment Articulatory-Acoustic Features

  28. THE FOLLOWING SLIDES PROVIDE DETAILS ABOUT THE COARSE WORD AND PHONE SCORES FOR THE 2000 AND 2001 EVALUATIONS ALTHOUGH THE WORD AND PHONE SCORES ARE ROUGHLY COMPARABLE ACROSS YEARS (FOR ANALOGOUS CONDITIONS) THE 2001 EVALUATION HAS FOUR TIMES THE NUMBER OF SCORING CONDITIONS (FOR PHONES) BASED ON THE “LENIENT” vs. STRICT TIME-MEDIATION AND THE COMPENSATED vs. UNCOMPENSATED TRANSCRIPTION SCORING Coarse Word and Phone Recognition

  29. Word Recognition Error (2000) • WORD ERROR RATES VARY BETWEEN 27% AND 43% • Substitutions are the major source of word errors Site Error Rate Error Type

  30. Unstressed Intermediate Stress Fully Stressed Prosodic Stress & Word Error Rate (2000) • The effect of stress is most concentrated among word-deletion errors Data represent averages across all eight ASR systems

  31. Syllable Structure & Word Error Rate (2000) • Vowel-initial forms show the greatest error • Polysyllabic forms exhibit the lowest error C = Consonant V = Vowel Data are averaged across all eight sites

  32. Syllable Structure & Word Error Rate (2000) • VOWEL-INITIAL forms exhibit the HIGHEST error • POLYSYLLABLES have the LOWEST error rate

  33. Word Recognition Error (2001) • WORD ERROR RATES VARY BETWEEN 33% AND 49% • Substitutions are the major source of phone errors Site Error Rate Error Type STRICT Time Mediation

  34. Word Recognition Error (2001) • WORD ERROR RATES VARY BETWEEN 31% AND 44% • Substitutions are the major source of phone errors Site Error Rate Error Type LENIENT Time Mediation

  35. Prosodic Stress & Word Error Rate (2001) • NOT YET • PROSODIC LABELING OF THIS MATERIAL REQUIRED FIRST • ANALYSIS SCHEDULED FOR JUNE, 2001

  36. Syllable Structure & Word Error Rate (2001) • Vowel-initial forms show the greatest error • Polysyllabic forms exhibit the lowest error, except fpr CVCV forms (probably due to forms such as “gonna,” etc.) Data are averaged across all five sites

  37. Syllable Structure & Word Error Rate (2001) • VOWEL-INITIAL forms exhibit the HIGHEST error • POLYSYLLABLES have the LOWEST error rate

  38. Are Word and Phone Errors Related? (2000) • COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON     THE PHONE ERROR RATE • The correlation between the two parameters is 0.78 Pronunciation Models? The differential error rate is probably related to the use of either pronunciation or language models (or both) Error Rate Submission Site

  39. Are Word and Phone Errors Related? (2001) • COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON     THE PHONE ERROR RATE Pronunciation Model? StrictTime Mediation TranscriptionUnCompensated Error Rate

  40. Are Word and Phone Errors Related? (2001) • COMPARISON OF THE WORD AND PHONE ERROR RATES ACROSS SITES SUGGESTS THAT WORD ERROR IS HIGHLY DEPENDENT ON     THE PHONE ERROR RATE Pronunciation Model? LenientTime Mediation TranscriptionUnCompensated Error Rate

  41. Phonetic - Pronunciation Mismatch • THERE ARE A FAR GREATER NUMBER OF PRONUNCIATIONS IN THE TRANSCRIPTION MATERIALS THAN IN THE ASR LEXICONS • GIVEN THAT MOST WORDS ARE CORRECTLY RECOGNIZED, THIS RESULT IMPLIES THAT PHONETIC CLASSIFICATION IN ASR SYSTEMS IS, BY NECESSITY, HIGHLY AGRANULAR • THUS, UNUSUAL PRONUNCIATIONS ARE UNLIKELY TO BE DECODED CORRECTLY • THE COARSE NATURE OF THE PRONUNCIATION MODELS ALSO MAKE IT DIFFICULT TO FINE-TUNE THE RELATION BETWEEN THE PHONETIC CLASSIFIER AND PRONUNCIATION MODEL COMPONENTS

  42. Pronunciation Variation in ASR Lexicons • MOST WORDS IN THE ASR LEXICONS HAVE A SINGLE PRONUNCIATION • EXCEPTIONS ARE HIGHLY FREQUENT WORDS (SUCH AS “THE” AND “AND” WHICH HAVE 2 OR 3 PRONUNCIATION VARIATIONS. NO WORD HAS MORE THAN 5 PRONUNCIATION VARIANTS (AT LEAST NOT IN THE PHONETIC OUTPUT PROVIDED TO ICSI FOR THE EVALUATION)

  43. Pronunciation Variation in Switchboard (2001) • THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR THE 100 MOST FREQUENT WORDS IN THE PHONETIC EVALUATION MATERIAL WORD INSTANCES #PRON WORD INSTANCES #PRON

  44. Pronunciation Variation in Switchboard (2001) • THERE ARE DOZENS OF DIFFERENT PRONUNCIATIONS FOR THE 100 MOST FREQUENT WORDS IN THE PHONETIC EVALUATION MATERIAL WORD INSTANCES #PRON WORD INSTANCES #PRON

  45. Phone Error and Word Length (2000) • For CORRECT words, only one phone (on average) is misclassified • Implication – short words are highly tolerant of phone “errors” • For INCORRECT words, phone errors increase linearly with word length Data are averaged across all eight sites

  46. Phone Error and Word Length (2001) • For CORRECT words, only one phone (on average) is misclassified • Implication – short words are highly tolerant of phone “errors” • For INCORRECT words, phone errors increase linearly with word length Data are averaged across all five sites

  47. Phone Error - Forced Alignment (2000) • PHONE ERROR RATES VARY BETWEEN 35% AND 49% • This, despite having the word transcript!!! Site Error Rate • AT&T, Dragon did not provide a complete set of forced alignments Error Type

  48. Phone Error - Forced Alignment (2001) • PHONE ERROR RATES VARY BETWEEN 40% AND 50% • Same picture for 2001. Suggests a potential mismatch between lexical and phonetic representations Site Error Rate Error Type STRICT Time Mediation Transcription UNcompensated

  49. Phone Error - Forced Alignment (2001) • PHONE ERROR RATES VARY BETWEEN 30% AND 44% • Still a poor match between phonetic transcripts and lexical reps Site Error Rate Error Type LENIENT Time Mediation Transcription UNcompensated

  50. Phone Error - Forced Alignment (2001) • PHONE ERROR RATES VARY BETWEEN 32% AND 38% • Still a lack of concordance with a tolerant scoring method Site Error Rate Error Type STRICT Time Mediation Transcription Compensated

More Related