530 likes | 815 Views
Tools for Sound, Speech, and Multi-modal Interaction. Johnny Lee 05-830 Advanced UI Software. Sound. Sound. Authoring Tools Recording, Playback SFX libraries Editing,Mixing MIDI Developer Tools Software APIs FFT libraries. Recording Sound.
E N D
Tools for Sound, Speech, and Multi-modal Interaction Johnny Lee 05-830 Advanced UI Software
Sound • Authoring Tools • Recording, Playback • SFX libraries • Editing,Mixing • MIDI • Developer Tools • Software APIs • FFT libraries
Recording Sound Most laptops have built-in mono microphones (Schoeps)
Playing Sound Most laptops have built in speakers
Multichannel Audio • ProTools by Digidesign – up to 64 channels of 24-bit, 48Khz audio I/O
Sound Libraries • SoundIdeas (http://www.sound-ideas.com/) • General 6000 • Hanna Barbara (http://gs304.sp.cs.cmu.edu/sfx/) • Lots of other smaller suppliers of stock sound libraries
Editing/Mixing Sounds • LogicAudio, SoundForge, Peak, SoundEdit16, many others. • Edits sound kind of like a text-editor. • Sophisitcated DSP (some realtime) • Synchronization with video and MIDI support
MIDI • “Musical Instrument Digital Interface” • Hardware communication layer • 5-pin din, uni-directional with pass-thru • Software protocol layer • MIDI Commands are 2-3 bytes • Note specification • Device configuration (128 controllers) • Device Control/Synchronization
MIDI • Lots of general purpose fields • Simple electronics (2 resistors and PIC processor) • Semi-popular option for simple control/robotics applications.
MOD files • File size can be tiny if using a MIDI synthesizer is used at playback time. • Playback quality depends on the quality of the synthesizer • MOD files (module format) combine MIDI data with WAV samples to produce high quality consistent playback in a relatively small file.
Microsoft – DirectX 9.0 • DirectX is : • DirectDraw – 2D drawing • Direct3D – 3D drawing • DirectInput – input/haptic devices • DirectPlay – network gaming • DirectShow – video streams • DirectSound – wave audio I/O • DirectMusic – soundtrack management and MIDI • DirectSetup – DirectX installation routines
DirectSound • WAV capture • Multi-channel sound playback • Full duplex • 3D specification of sound sources. • Some real-time DSP: Chorus, Compression, Flange, Distortion, Echo, Reverb
DirectMusic • Coordinates several sound files (MIDI, wav, etc.) into “soundtracks”. • Sequencing (timelines, cueing, and synchronization). • Supports dynamic composition, variation, and transitioning between songs/parts. • Dynamic content authored in DirectMusic Producer
DirectMusic • Compositions can be made with DLS (downloadable sound) files – a cross-platform “smart” audio file format designed for dynamic loading in interactive applications. • DLS = MIDI + WAV for interactive apps
MacOS X – Core Audio • Sound Manager – routines for resource management and play/recording sound • AudioToolbox – sophisitcated DSP architecture, sequencing/composition • MIDI Services – device abstraction, control, and patching • Audio HAL – medium level I/O access (real-time, low-latency, multi-channel, floating point is standard access) • IOKit – low level device access • Drivers, Hardware - blarg • Full Java API provided
Java • Basic data structures and routines for loading, playing, and stopping sounds. • java.applet.AudioClip • javax.sound.midi • javax.sound.midi.spi • javax.sound.sampled • javax.sound.sampled.spi • I/O device access is somewhat limited. • I’ve been told that synchronization is a problem in Java.
Voice as Sound • “Voice as sound: using non-verbal voice input for interactive control.” Takeo Igarashi, John F. Hughes: UIST 2001: 155-156” • STFT, FFT analysis • Extension to SUITEKeys
Fourier Transform(FT) • Simple “properties” about a sound can be gotten by looking at the data file: duration, volume • More interesting analysis requires some DSP – mainly Fourier Transform.
Fourier Transform • FT extracts the frequency content from a given segment of audio.
Fast Fourier Transform(FFT) • FFT is a fast computational algorithm for doing discrete Fourier transform (DFT). • Implementations available in most languages. • Good reference source: Numerical Recipes in C++
Speech Synthesis Three categories of speech synthesizers: • Articulatory synth - uses physical model of the physiology of speech production and physics of sound generation in the vocal apparatus • Formant synth - acoustic-phonetic approach to synthesis. Applies hundreds of “filters” loosely associated to the movement of articulators using rules. • Concatenative synth - segmental database that reflects the major phonological features of a language. Creates smooth transitions and basic processing to match prosodic patterns (http://cslu.cse.ogi.edu/HLTsurvey/ch5node4.html)
ATT Natural Voices • US English, UK English, French, Spanish, German, Korean • Can build a new voice font from an existing person • Examples: • Male Voice • Custom UK English • Voice Font • French
Phoenix Semantic Frame Parser • Center for Spoken Language Research, University of Colorado, Boulder • http://communicator.colorado.edu/phoenix/license.html • System for processing and parsing natural language
Phoenix Details and Syntax for creating frames and networks: http://communicator.colorado.edu/phoenix/Phoenix_Manual.pdf
Universal Speech Interfaces Universal speech interfaces> Ronald Rosenfeld , Dan Olsen , Alex Rudnicky> Interactions October 2001> Volume 8 Issue 6 • “In essence, we attempt to do for speech what Palm’s Graffiti™ has done for mobile text entry. “ • http://www-2.cs.cmu.edu/~usi/USI-manifesto.htm • “Speech is an ambient medium.” • “Speech is descriptive rather than referential.” • “Speech require modest physical resources.” • “Only speech will scale as digital technology progresses.” • 3 Speech interaction techniques: Natural Language (NLI, NLP), Dialog Trees, Command and Control
Universal Speech Interfaces • “Look and Feel”::”Sound and Say” • Universal Metaphors – familiar ways of doing things across applications. • Universal User Primitives – standard dialog interaction techniques, detection, recovering from error, asking for help, navigation, etc. • Universal Machine Primitives – standardize machine responses and meanings to increase user understanding.
Java Speech • JSAPI – Java Speech API • Speech Generation • Structure Analysis – Java Synthesis Markup Language (JSML) • Text Pre-Processing – abbreviation, acronyms, “1998” • Text-to-Phoneme Conversion • Prosody Analysis • Waveform Production • Speech Recognition • Grammar Design - Java Speech Grammar Format (JSGF) • Signal Processing • Phoneme Recognition • Word Recognition • Result Generation
Windows .NET Speech SDK • Basically the .NET-ified SAPI 5.1 (Speech API) • Continuous Speech Recognition (US English, Japanese, and Simplified Chinese) • Concatenative Speech Synthesis (US English and Simplified Chinese) • Interface is broken into two components: • Application Programming Interface (API) • Device Driver Interface(DDI)
Windows .NET Speech SDK • Speech Synthesis API • ISpVoice::Speak(“my text”, voice); • Speech Synthesis DDI • Prases text into an XML doc • Calls the TTSEngine • Manages sound and threading details
Windows .NET Speech SDK • Speech Recognition API • Define context • Define grammar • Request type (dictation or command/control) • Event is fired when recognized • Speech Recognition DDI • Interfacing and configuring the SREngine • Manages sound and threading details.
Windows .NET Speech SDK • Speech Application Language Tags (SALT) – extension to HTML for speech integration in to webpages • Speech Recognition Grammar Specification (SRGS) support for field parsing • Telephony Controls – interfaces with telephone technology to develop voice-only apps.
MacOS X Speech • Barely changed since 1996, MacInTalk 3 • US English only • Full Java API • Speech Synthesis Manager (PlainTalk) • algorithmic voice generation • Speech Recognition Manager • OS wide push-to-talk Command/Control • Customizable vocabulary w/scripting • Uses “Language Model” = grammar • No dictation support
Dragon Naturally Speaking • Commercial Recognition software • Dictation • Command and control • API available for developers for application integration • http://www.scansoft.com/naturallyspeaking/
Sphinx • Open source speech recognizer from CMU (http://fife.speech.cs.cmu.edu/sphinx/) • Auto-builds language model/grammer&vocabulary from example sentences • CMU-Cambridge Statistical Language Modeling Toolkit – semi-machine learning algorithms for digesting a large example corpus into a usable model • Uses CMU Pronouncing Dictionary • SphinxTrain - builds new acoustic models • Audio recording, transcript, pronunciation dictionary/vocabulary, phoneme list
SUITEKeys • Manaris,B., McCauley,R., MacGyvers,V., An Intelligent Interface for Keyboard and Mouse Control--Providing Full Access to PC Functionality via Speech, Proceedings of 14th International Florida AI Research Symposium (www.cs.cofc.edu/~manaris/) • Developed for individuals with motor disabilities. • Interface layer that generates keyboard and mouse events for the OS • Recognizes keyboard strokes/operations: backspace, function twleve, control-alt-delete, page down, press…. release • Recognizes mouse buttons and movement: left-click, move down…. Stop, 2 units above clock, move to 5-18
Scott R. Klemmer , Anoop K. Sinha , Jack Chen , James A. Landay , Nadeem> Aboobaker , Annie Wang> Proceedings of the 13th annual ACM symposium on User interface software and> technology November 2000 Suede • Wizard of OZ tool for prototyping speech interfaces • Allows the developer to quicky generate a state machine representing the possible paths through a speech interface and stores recorded system responses. • Operator simulates a functional system during evaluation by stepping through the state machine. • Runtime transcripts are recorded for later analysis.