360 likes | 585 Views
S nack for R uby. S Legrand. Talk Objectives. Tour of API Learn the walk and talk Have Fun. S nack. Snack library is a tool to aid in the learning about sound, voice, ASR, and is hopefully a fun way to experiment Snack is a tcl-based API
E N D
SnackforRuby S Legrand
Talk Objectives • Tour of API • Learn the walk and talk • Have Fun
Snack • Snack library is a tool to aid in the learning about sound, voice, ASR, and is hopefully a fun way to experiment • Snack is a tcl-based API • Snack has been adapted to and included in Standard Python Distribution
Snack • Snack is Swedish for “talk” or “chat” • Kåre Sjölanderis the principal investigator for tcl-based snack • Tcl Snack is available at http://www.speech.kth.se/snack/
Snack for Ruby • rbSnack is a ruby wrapper around tcl snack • rbSnack has additional ruby based utilities • rbSnack has html-based help. (rdoc+rbTeX) • rbSnack can be found at http://rbsnack.sourceforge.net/
Snack Toolkit Includes • Recording, Playback • Waveform display • Spectrogram: Fourier, LPC • Formant analysis • Power analysis • Filters (will demo)
The Speech Signal • Continuous speech is discretely sampled • Signal consist of rapidly changing data points. • The display of the sampled signal is called the waveform • Snack can display the waveform real-time
Analysis uses frames • Signal is broken into frames • Frames may overlap • Characteristics of signal analyzed using Fourier and LPC analysis on a per frame basis.
Going in Circles • Complex numbers is just a funny way of multiplying: add angles. • Eulers formula
Fourier Analysis • Fourier matrix is an unitary matrix • Multiplication by Fourier matrix returns the frequency components of the signal, called the Fourier coefficients • Easy to compute the inverse: Called Fourier Inverse
The Fourier Matrix Looks Like • Spinning disks Multiplication by signal produces Fourier coefficients (frequency components)
Examining Fourier components • A Spectrogram gives a picture of the Fourier components (coefficients) as they evolve over time. Snack can display real time. • Looks like an X Ray • Bands of high activity correspond to formants
Linear Filters • Useful to understand nature of speech signals • Generators: generate square waves, sin waves, saw tooth, etc. • Composers: composes several filters. • FIR: Finite impulse response • IIR: Infinite impulse response
FIR Filter • Determined completely by response to a unit impulse. • Response finite in duration. y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n) (We will demo FIR using rbSnack)
IIR Filter • Also called Recursive filter • Response infinite in duration. y(t)=b0 x(t) + b1 x(t-1)+ b2x(t-2)+…+bn x(t-n) +a1 y(t-1)+ a2y(t-2)+…+an y(t-n) (We will demo IIR using rbSnack)
Linear Predictive Analysis • Analogous to Fourier analysis • Assumption: For each frame, the signal is predicted by • The LPC coefficients are the best least squares approximation. • Can also be used to predict formants y(t)=a1 y(t-1)+ a2y(t-2)+…+ap y(t-p)
What is Sound? What is Speech? • Sound is the resulting signal created by the longitude waves in some medium like air. • Sound waves are continuous • Can be decomposed into linear combination of sin waves. • Speech is a special noise made by humans
It’s Just Tubing… • The simplest model of speech is to consider the lungs and trachea as one long tube. • Resonance frequencies are called Formants. F2 F1
Some Speech Recognition Features • Formants • Pitch • Voiced/Unvoiced • Nasality • Frication • Energy Our current work only uses Formants and Energy
Basic Utterances • A basic unit of speech is called a Phone • Vowels are utterances with constant formants • Diphthong is the transitioning from one vowel to another • Vowels and Diphthongs are essentially characterized by the first and second formant.
Other Phones: The Consonants • Plosives: closure in oral cavity /p/ • Nasal: Closure of nasal cavity /m/ • Fricative: Turbulent airstream noise /s/ • Retroflex liquid: Vowel like-tongue high curled back /r/ • Lateral liquid: Vowel like, tongue central, side air stream /l/ • Glide: Vowel like /y/
Some Problems with Speech Signals • Segmentation: when does a word begin and end? (Noise?) • Wet ware: (speaker’s internal configuration + lip smacks, breathing etc.) SegmentationWorkshop demos one approach.
Code Books • A code book consists of code words. • Idea is to search through code book to find code word corresponding to best match of feature sequence. • RbSnack uses codebook approach in word recognition.
Code Book Approach • ++ Easy to implement • + Good for isolated words • +- Works best on small vocabularies • -- Is insensitive to context, prone to errors
Code Book Approach • WhichWay is a simple demo of this approach
More Problems with Speech Signals • Accent: Southern vs. New England vs. California Valley vs. Other. • Variation in rate of speech makes it hard to compare words
Dynamic Time Warping • A pattern comparison technique • A way of stretching or compressing one sequence to match another. • Evaluated using dynamic programming
Dynamic Programming • Form a grid, with start at lower left, end at upper right. • Label each node with difference (error) between pattern 1 at time i and pattern 2 at time j. • Find minimal distance from start to end using
Dynamic Programming Basic Assumption: If best path P(S,E) passes through node N, then P(S,E) is the concatenation of P(S,N) (best from S to N) and P(N,E) (best from N to E) • A possible path
Dynamic Programming 1 RbSnack includes examples for various time alignment approaches 3 2 1 2 3 Type I Type III
Dynamic Programming 1 1 1 1 1 1 1 1 Itakura Type IV
Hidden Markov Models • Sometime the second (or third) best match is the right word. Use HMM’s to ascertain the correct word in the context of the sentence. (Ditto for phones within a word) • HMM’s are similar to non-deterministic finite state machines, except for they have non-deterministic output.
Hidden Markov Models • Dynamic Programming is used to compute weights. • HMM’s look like .4 .2 2 3 1 P(/i/)=.5 P(/a/)=.2 P(/o/)=.3 .4 4
PossibleFuture Directions • Examine other features, (pitch?) • Incorporate other libraries. (Do the computationally hard work in C) • Add more signal processing routines • Add more examples • Use Hidden Markov Models
Lessons Learned/to be learned • Document everything. • Nothings perfect • Automate everything • Project is never done
What’s next? • Try it out.