330 likes | 472 Views
Automatic Lip-Synchronization Using Linear Prediction of Speech. Christopher Kohnert SK Semwal University of Colorado, Colorado Springs. Topics of Presentation. Introduction and Background Linear Prediction Theory Sound Signatures Viseme Scoring Rendering System Results Conclusions.
E N D
Automatic Lip-Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs
Topics of Presentation • Introduction and Background • Linear Prediction Theory • Sound Signatures • Viseme Scoring • Rendering System • Results • Conclusions
Justification • Need: • Existing methods are labor intensive • Poor results • Expensive • Solution: • Automatic method • “Decent” results
Applications of Automatic System • Typical applications benefiting from an automatic method: • Real-time video communication • Synthetic computer agents • Low-budget animation scenarios: • Video games industry
Automatic Is Possible • Spoken word is broken into phonemes • Phonemes are comprehensive • Visemes are visual correlates • Used in lip-reading and traditional animation
Existing Methods of Synchronization • Text Based • Analyze text to extract phonemes • Speech Based • Volume tracking • Speech recognition front-end • Linear Prediction • Hybrids • Text & Speech • Image & Speech
Speech Based is Best • Doesn’t need script • Fully automatic • Can use original sound sample (best quality) • Can use source-filter model
Source-Filter Model • Models a sound signal as a source passed through a filter • Source: lungs & vocal cords • Filter: vocal tract • Implemented using Linear Prediction
Speech Related Topics • Phoneme recognition • How many to use? • Mapping phonemes to visemes • Use visually distinctive ones (e.g. vowel sounds) • Coarticulation effect
The Coarticulation Effect • The blending of sound based on adjacent phonemes (common in every-day speech) • Artifact of discrete phoneme recognition • Causes poor visual synchronization (transitions are jerky and unnatural)
Speech Encoding Methods • Pulse Code Modulation (PCM) • Vocoding • Linear Prediction
Pulse Code Modulation • Raw digital sampling • High quality sound • Very high bandwidth requirements
Vocoding • Stands for VOice-enCODing • Origins in military applications • Models physical entities (tongue, vocal cord, jaw, etc.) • Poor sound quality (tin can voices) • Very low bandwidth requirements
Linear Prediction • Hybrid method (of PCM and Vocoding) • Models sound source and filter separately • Uses original sound sample to calculate recreation parameters (minimum error) • Low bandwidth requirements • Pitch and intonation independence
Linear Prediction Theory • Source-Filter model • P coefficients are calculated Filter Source
Linear Prediction Theory (cont.) • The ak coefficients are found by minimizing the original sound (St) and the reconstructed sound (si). • Can be solved using Levinson-Durbin recursion.
Linear Prediction Theory (cont.) • Coefficients represent the filter part • The filter is assumed constant for small “windows” on the original sample (10-30ms windows) • Each window has its own coefficients • Sound source is either Pulse Train (voiced) or white noise (unvoiced)
Linear Prediction for Recognition • Recognition on raw coefficients is poor • Better to FFT the values • Take only first “half” of FFT’d values • This is the “signature” of the sound
Sound Signatures • 16 values represent the sound • Speaker independent • Unique for each phoneme • Easily recognized by machine
Viseme Scoring • Phonemes were chosen judiciously • Map one-to-one to visemes • Visemes scored independently using history • Vi= 0.9 * Vi-1 + 0.1 * {1 if matched at i, else 0} • Ramps up and down with successive matches/mismatches
Rendering System • Uses Alias|Wavefront’s Maya package • Built-in support for “blend shapes” • Mapped directly to viseme scores • Very expressive and flexible • Script generated and later read in • Rendered to movie, QuickTime used to add in original sound and produce final movie.
Results (Timing) • Precise timing can be achieved • Smoothing introduces “lag”
Results (Other Examples) • A female speaker using male phoneme set Slower speech, male speaker
Results (Other Examples) (cont.) • Accented speech with fast pace
Results (Summary) • Good with basic speech • Good speaker independence (for normal speech) • Poor performance when speech: • Is too fast • Is accented • Contains phonemes not in the reference set (e.g. “w” and “th”)
Conclusion • Linear Prediction provides several benefits: • Speaker independence • Easy to recognize automatically • Results are reasonable, but can be improved
Future Work • Identify best set of phonemes and visemes • Phoneme classification could be improved with better matching algorithm (neural net?) • Larger phoneme reference set for more robust matching
Results • Simple cases work very well • Timing is good and very responsive • Robust with respect to speaker • Cross-gender, multiple male speakers • Fails on: accents, speed, unknown phonemes • Problems with noisy samples • Can be smoothed but introduces “lag”
Automatic Is Possible • Spoken word is broken into phonemes • Phonemes are comprehensive • Visemes are visual correlates • Used in lip-reading and traditional animation • Physical speech (vocal cords, vocal tract) can be modeled • Source-filter model
Results (Normal Speech) • Normal speech, moderate pace