490 likes | 739 Views
Physics of Human Voice A New Theory with Applications Research Conference, November 16, 2012. C. Julian Chen Department of Applied Physics and Applied Mathematics Columbia University. Outline. The source-filter theory Speech processing based on source-filter theory
E N D
Physics of Human Voice A New Theory with ApplicationsResearch Conference, November 16, 2012 C. Julian Chen Department of Applied Physics and Applied Mathematics Columbia University
Outline • The source-filter theory • Speech processing based on source-filter theory • A new theory of human voice production • Timing correlation of voice and EGG signals • The concept of timbrons • Speech processing based on timbrons • Kramers-Kronig relations • Fourier analysis • Laguerre-function expansion • Timbre vectors • Applications: voice transformation, speech synthesis and speech recognition
Scientific Basis of Speech Technology • Understand the physics of speech production • Understand the physics of hearing • Discover accurate and efficient parametric representations of speech signal • Design methods to convert speech signal into a parametric representation • Design methods to accurately recover speech from a parametric representation • Develop methods to modify and manipulate speech through the parametric representation
Source-Filter Theory of Voice Production The first edition of Gunnar Fant’s book was published in 1960.
Source-Filter Theory of Voice Production (1) A quasi-periodic pulsating airflow generated by the opening-closing of the glottis creates a buzzing source, then filtered by the spectrum of upper vocal track to form the voice (Fant 1960).
Source-Filter Theory of Voice Production (2) Strong airflow occurs during the opening of the glottis. No air flows when there is a glottal stop.
Traditional Speech Processing: Windowing Because processing windows are asynchronous to pitch periods, it always compromises speech signals.
Traditional Speech Processing: Phone alignment errors due to improper windowing Because a substantial number of processing windows are crossing phone boundaries, automatic phone alignment inevitably produce a large percentage of errors.
The Electroglottograph (EGG) A non-invasive instrument to detect the change of electric conductance between the two vocal cords, thus to monitor the opening and closing of the glottis (circa 1960).
Simultaneously recordedvoice and EGG signals (1) (1). A, Glottal closures are very fast. (2) Strongest voice signal C immediately follows the glottal closure. (3) Voice signals in the glottal open phase B is much weaker.
Simultaneously recordedvoice and EGG signals (2) Showing two individual glottal closures, each triggers a timbron, solely determined by the geometry of vocal tract.
The Timbron Theory (1) When the glottis is open, there is a continuous airflow in the vocal tract. A glottal closure abruptly stops airflow supply, excites a d’Alembert wave front, which resonates in the vocal tract. The waveform represents the instantaneous timbre.
The Timbron Theory (2) Energetics: Velocity of airflow 0.2 m/sec. Volume of the vocal tract 2×10-5m3 Density of air 1.25 kg/m3 Kinetic energy = ½ × 2 ×10-5 × 1.25 × 0.2 2 = 0.5 μJ. Frequency 100 Hz, power is 50 μW Typical speech power: 10 – 100 μW It matches perfectly with the typical measured speech power.
Simplified Cartoons on Timbrons In the following, we present two simplified cartoons about the formation of timbrons. The first set is about vowel [u:], a typical front vowel. The second set is about [ɑ:]. Using typical geometrical values of the vocal tract, the cartoons explain the first formants of [u:] and [ɑ:]. Nevertheless, the cartoons are designed only for intuitive understanding of the concept of timbrons. In order to explain the timbrons accurately, numerical solutions of the wave equations are necessary.
Timbron [u:] - preparation Before a glottal closure, t<0, there is a continuous airflow with typical velocity of 0.2 m/sec. The distance between the glottis and lips is about 25 cm. Beyond the lips, the cross section greatly expands, thus the airflow velocity is very small.
Timbron [u:] – phase 1 A glottal closure abruptly stops the supply of airflow, excites a zero-velocity d’Alembert wavefront, propagating with the speed of sound. The air behind the wavefront is rarefied. It takes about 0.8 msec for the wavefront to reach lips. Then the air beyond the lips rushes in to fill the partial vacuum.
Timbron [u:] – phase 2 The d’Alembert wavefront of a velocity towards the glottis continuous to propagate with the speed of sound. Due to radiation loss, the velocity is slightly reduced. It takes 0.8 msec for the wavefront to reach the glottis. The acoustic wave is reflected and propagate towards lips.
Timbron [u:] – phase 3 The d’Alembert wavefront of a velocity towards the glottis propagate towards the lips with the speed of sound. The air behind the wavefront is densified, which stores energy. It takes 0.8 msec for the wavefront to reach the lips. The wavefront starts to propagate towards the glottis.
Timbron [u:] – phase 4 A new wavefront of a velocity towards the lips propagates towards the glottis with the speed of sound. It reaches the glottis and the entire cycle starts over. The cycle takes 3.2 msec, corresponds to a frequency 310 Hz. It is the first formant frequency of vowel [u:].
Timbron [ɑ:] - preparation Before a glottal closure, t<0, there is a continuous airflow with typical velocity of 0.2 m/sec. The distance between the glottis and oropharynx is about 12 cm. Beyond the oropharynx, with a widely open mouth, the airflow velocity is very small.
Timbron [ɑ:] – phase 1 A glottal closure abruptly stops the supply of airflow, excites a zero-velocity d’Alembert wavefront, propagating with the speed of sound. The air behind the wavefront is rarefied. It takes 0.4 msec for the wavefront to reach the oropharynx. Air in the mouth rushes in to fill the partial vacuum.
Timbron [ɑ:] – phase 2 The d’Alembert wavefront of a velocity towards the glottis continuous to propagate with the speed of sound. Due to radiation loss, the velocity is slightly reduced. It takes 0.4 msec for the wavefront to reach glottis. The acoustic wave is reflected and propagate towards oropharynx.
Timbron [ɑ:] – phase 3 A new wavefront of velocity towards the glottis propagate towards the oropharynx with the speed of sound. Air behind the wavefront is densified, which stores energy. It takes 0.4 msec for the wavefront to reach oropharynx. A new wavefront starts to propagate towards the glottis.
Timbron [ɑ:] – phase 4 Finally, a new wavefront of velocity towards oropharynx propagates towards the glottis with the speed of sound. It reaches the glottis and the cycle starts over. The entire cycle takes 1.6 msec, corresponds to a frequency of 625 Hz. It is the first formant frequency of vowel [ɑ:].
Two Mathematical Theorems Theorem 1: The phase spectrum of a timbron is uniquely determined by its amplitude spectrum Proof: Because before a glottal closure, the value of a timbron is zero, using theory of functions in complex variables, the phase spectrum can be calculated using an improper integral, similar to Kramers-Kronig relations. Theorem 2:For a voice generated by a periodic sequence of glottal closures , the waveform in a complete period contains full information about the underlying timbron. Proof: Again, using the fact that the value of a timbron is zero before a glottal closure, the theorem can be proved using basic properties of Fourier transform.
The New Parameterization Using both voice signal and electroglottograph signal to segment the voice into natural frames.
Convert Spectrum into Timbre Vectors The timbre vector has some similarity to the state vector in quantum mechanics.
Accuracy of Timbre-Vector Representation Because Laguerre functions are complete and orthonormal, the timber vector can be as accurate as needed, in stark contrast to the inaccurate and incomplete LPC coefficients.
Examples of Timbrons Obtained from ARCTIC databases, speaker bdl, sentence a0008. The sentence was converted into a sequence of timbre vectors, then using Kramers-Kronig relations to recover the phase. The timbrons are then generated by FFT. Each timbron is 15 msec. The first 2.5 msec is pre-excitation waveform, theoretically should be zero. A timbron is a complete representation of the instantaneous timbre. Different vowels show very different waveforms. The starting frame of plosive [K] can also be represented by a timbron, with a phase spectrum determined by its amplitude spectrum. The subsequent frames of [K] do not have well-defined phase.
Timbron of Consonant [k]. First frame. 15 msec. (bdl a0008. Frame 155)
Timbron of Consonant [k]. 2nd frame. 15 msec. (bdl a0008. Frame 156)
Timbron of Part of Vowel[AY]. 15 msec. (bdl a0008. Frame 291)
Timbre vectors can be fused to eliminate seams Speech segment 1 Speech segment 2 Using fusing process, the entire speech section becomes natural.
Voice Demo 1: Speech Regeneration slt jmk bdl original regenerated The original recorded speech was converted into timbre vector form and regenerated. There is very little quality degradation.
Voice Demo 2: Voice Transformation The pitch and head-size can be changed dramatically. By raising the pitch 6 halftones each time, the original voice, a tenor, can be changed to female and child voices. contralto mezzo-soprano soprano child By lowering the pitch 6 halftones each time, a tenor voice can be changed to very deep male voices. baritone bass contra-bass giant Although deeply into the falsetto register and vocal-fry register, the voice is still clear and human-like.
Voice Demo 3: Speed Variation The speed can be changed from 100 words per minute to 1000 words per minute, and the voice is still clear. The low speed can be used for foreign language education. The high speed is a great advantage for visually impaired people. 100 wpm 150 wpm 200 wpm 300 wpm 700 wpm 400 wpm 500 wpm 600 wpm 800 wpm 900 wpm 1000 wpm
Voice Demo 4: Prosody Modification Voice Affirmation Question Mezzosoprano Tenor Baritone Bass Contrabass Soprano Child -- change an affirmation into a question --
Summary A new theory of human voice production Based on simultaneous voice and EGG signals The concept of timbrons Speech processing based on timbrons Kramers-Kronig relations Fourier analysis Laguerre-function expansion Timbre vectors Applications: voice transformation, speech synthesis and speech recognition