260 likes | 407 Views
Speech Processing. Applications of Images and Signals in High Schools. AEGIS RET All-Hands Meeting Florida Institute of Technology July 6, 2012. Contributors. Dr . Veton Këpuska , Faculty Mentor, FIT vkepuska@fit.edu Jacob Zurasky , Graduate Student Mentor, FIT jzuraksy@my.fit.edu
E N D
Speech Processing Applications of Images and Signals in High Schools AEGIS RET All-Hands Meeting Florida Institute of Technology July 6, 2012
Contributors Dr. VetonKëpuska, Faculty Mentor, FIT vkepuska@fit.edu Jacob Zurasky, Graduate Student Mentor, FIT jzuraksy@my.fit.edu Becky Dowell, RET Teacher, BPS Titusville High dowell.jeanie@brevardschools.org
Timeline • 1874: Alexander Graham Bell proves frequency harmonics from electrical signal can be divided • 1952: Bell Labs develops first effective speech recognizer • 1971-1976 DARPA: speech should be understood, not just recognized • 1980’s: Call center and text-to-speech products commercially available • 1990’s: PC processing power allow of SR software by ordinary user Timeline of Speech Recognition. http://www.emory.edu/BUSINESS/et/speech/timeline.htm
Motivation • Applications • Call center speech recognition • Speech-to-text applications (e.g. dictation software) • Hands-free user-interface (e.g., OnStar, XBOX Kinect, Siri) • Science Fiction 1968: Stanley Kubrick’s 2001: A Space Odysseyhttp://www.youtube.com/watch?v=6MMmYyIZlC4 • Science Fact 2011: Apple iPhone 4S Sirihttp://www.apple.com/iphone/features/siri.html
Difficulties • Continuous Speech (word boundaries) • Noise • Background • Other speakers • Differences in speakers • Dialects/Accents • Male/female
Motivation • Speech recognition requires speech to first be characterized by a set of “features”. • Features are used to determine what words are spoken. • Our project implements the feature extraction stage of a speech processing application.
Speech Recognition Front End: Pre-processing Back End: Recognition Features Recognized speech Speech Large amount of data. Ex: 256 samples Reduced data size. Ex: 13 features • Front End – reduce amount of data for back end, but keep enough data to accurately describe the signal. Output is feature vector. • 256 samples ------> 13 features • Back End - statistical models used to classify feature vectors as a certain sound in speech
Front-End Processing of Speech Recognizer • High pass filter to compensate for higher frequency roll off in human speech • Pre-emphasis
Front-End Processing of Speech Recognizer • High pass filter to compensate for higher frequency roll off in human speech • Separate speech signal into frames • Apply window to smooth edges of framed speech signal • Window • Pre-emphasis
Front-End Processing of Speech Recognizer • High pass filter to compensate for higher frequency roll off in human speech • Separate speech signal into frames • Apply window to smooth edges of framed speech signal • Window • FFT • Pre-emphasis • Transform signal from time domain to frequency domain • Human ear perceives sound based on frequency content
Front-End Processing of Speech Recognizer • High pass filter to compensate for higher frequency roll off in human speech • Separate speech signal into frames • Apply window to smooth edges of framed speech signal • Window • FFT • Pre-emphasis • Mel-Scale • Transform signal from time domain to frequency domain • Human ear perceives sound based on frequency content • Convert linear scale frequency (Hz) to logarithmic scale (mel-scale)
Front-End Processing of Speech Recognizer • High pass filter to compensate for higher frequency roll off in human speech • Separate speech signal into frames • Apply window to smooth edges of framed speech signal • Window • FFT • log • Pre-emphasis • Mel-Scale • Transform signal from time domain to frequency domain • Human ear perceives sound based on frequency content • Convert linear scale frequency (Hz) to logarithmic scale (mel-scale) • Take the log of the magnitudes (multiplication becomes addition) to allow separation of signals
Front-End Processing of Speech Recognizer • High pass filter to compensate for higher frequency roll off in human speech • Separate speech signal into frames • Apply window to smooth edges of framed speech signal • Window • FFT • log • IFFT • Pre-emphasis • Mel-Scale • Transform signal from time domain to frequency domain • Human ear perceives sound based on frequency content • Convert linear scale frequency (Hz) to logarithmic scale (mel-scale) • Take the log of the magnitudes (multiplication becomes addition) to allow separation of signals • Inverse of FFT to transform to Cepstral Domain… the result is the set of “features”
Speech Analysis and Sound Effects (SASE) Project • Graphical User Interface (GUI) • Speech input • Record and save audio • Sound file (*.wav, *.ulaw, *.au) • Graphs the entire audio signal • Select a “frame” by clicking on graph • Process speech frame and display output for each stage of processing • Displays spectrogram
GUI Components Plotting Axes
Buttons GUI Components Plotting Axes
MATLAB Code • Graphical User Interface (GUI) • GUIDE (GUI Development Environment) • Callback functions • Work in progress • Extendable • Stages of speech processing • Modular functions for reusability
SASE Lab • Interactive teaching tool • Demo
Future Work • Improve GUI • Audio Effects • Ex: Echo, Reverberation, Chorus, Flange • Noise Filtering
References • Ingle, Vinay K., and John G. Proakis. Digital signal processing using MATLAB. 2nd ed. Toronto, Ont.: Nelson, 2007. • Oppenheim, Alan V., and Ronald W. Schafer. Discrete-time signal processing. 3rd ed. Upper Saddle River: Pearson, 2010. • Weeks, Michael. Digital signal processing using MATLAB and wavelets. Hingham,Mass.: Infinity Science Press, 2007. • Timeline of Speech Recognition. http://www.emory.edu/BUSINESS/et/speech/timeline.htm
Thank you! Questions?
Unit Plan • Introduction • Lesson #1: The Sound of a Sine Wave • Lesson #2: Frequency Analysis • Lesson #3: Filtering (work in progress) • Lesson #4: SASE Lab (work in progress) • Conclusion