400 likes | 414 Views
Sound. What is Sound ?. Acoustics is the study of sound. Physical - sound as a disturbance in the air Psychophysical - sound as perceived by the ear Sound as stimulus (physical event) & sound as a sensation. Pressures changes (in band from 20 Hz to 20 kHz) Physical terms Amplitude
E N D
What is Sound ? • Acoustics is the study of sound. • Physical - sound as a disturbance in the air • Psychophysical - sound as perceived by the ear • Sound as stimulus (physical event) & sound as a sensation. • Pressures changes (in band from 20 Hz to 20 kHz) Physical terms • Amplitude • Frequency • Spectrum
Sound Waves • In a free field, an ideal source of acoustical energy sends out sound of uniform intensity in all directions. => Sound is propagating as a spherical wave. • Intensity of sound is inversely proportional to the square of the distance (Inverse distance law). • 6 dB decrease of sound pressure level per doubling the distance.
How we hear • Ear connected to the brain • left brain: speech • right brain: music • Ear's sensitivty to frequency is logarithmic • Varying frequency response • Dynamic range is about 120 dB (at 3-4 kHz) • Frequency discrimination 2 Hz (at 1 kHz) • Intensity change of 1 dB can be detected.
Digital Sampling • Sampling is dictated by the Nyquist sampling theorem which states how quickly samples must be taken to ensure an accurate representation of the analog signal. • The Nyquist sampling theorem states that the sampling frequency must be greater than the highest frequency in the original analog signal.
Sound Sampling Basics • Common Sampling Rates • 8KHz (Phone) or 8.012820513kHz (Phone, NeXT) • 11.025kHz (1/4 CD std) • 16kHz (G.722 std) • 22.05kHz (1/2 CD std) • 44.1kHz (CD, DAT) • 48kHz (DAT) • Bits per Sample • 8 or 16 • Number of Channels • mono/stereo/quad/ etc.
Space Requirements Storage Requirements for One Minute of Sound
Common Sound File Formats • Mulaw (Sun, NeXT) .au • RIFF Wave (MS WAV) .wav • MPEG Audio Layer (MPEG) .mpa .mp3 • AIFC (Apple, SGI) .aiff .aif • HCOM (Mac) .hcom • SND (Sun, NeXT) .snd • VOC (Soundblaster card proprietary standard) .voc • AND MANY OTHERS!
What’s in a Sound File Format • Header Information • Magic Cookie • Sampling Rate • Bits/Sample • Channels • Byte Order • Endian • Compression type • Data
Example File Format (NIST SPHERE) NIST_1A 1024 sample_rate -i 16000 channel_count -i 1 sample_n_bytes -i 2 sample_byte_format -s2 10 sample_sig_bits -i 16 sample_count -i 594400 sample_coding -s3 pcm sample_checksum -i 20129 end_head
WAVe file format (Microsoft) RIFF A collection of data chunks. Each chunk has a 32-bit Id followed by a 32-bit chunk length followed by the chunk data. 0x00 chunk id 'RIFF' 0x04 chunk size (32-bits) 0x08 wave chunk id 'WAVE' 0x0C format chunk id 'fmt ' 0x10 format chunk size (32-bits) 0x14 format tag (currently pcm) 0x16 number of channels 1=mono, 2=stereo 0x18 sample rate in hz 0x1C average bytes per second 0x20 number of bytes per sample 1 = 8-bit mono 2 = 8-bit stereo or 16-bit mono 4 = 16-bit stereo 0x22 number of bits in a sample 0x24 data chunk id 'data' 0x28 length of data chunk (32-bits) 0x2C Sample data
Digital Audio Today • Analog elements in the audio chain are replaced with digital elements. • 16-bit wordlength, 32/44.1/48 kHz sampling rates. • Mostly linear signal processing. • Wide range of digital formats and storage media. • Rapid development of 1-bit conversation technology => better SNR, phase and linearity. • Rapid increase of signal processing power => possibility to implement new, complex features.
CD vs LP • Information is stored digitally. • The length of its data pits represents a series of 1s and 0s. • Both audio channels are stored along the same pit track. • Data is read using laser beam. • Information density about 100 times greater than in LP. • CD player can correct disc errors.
Benefits of CD • Robust • No degradation from repeated playings because data is read by the laser beam. • Error correction • Transport`s performance does not affect the quality of audio reproduction. • Digital circuitry more immune to aging and temperature problems • Data conversion is independent of variations in disc rotational speed, hence wow and flutter are neglible. • SNR over 90 dB. • Subcode for display, control and user information
Digital CD Format • Sampling • 44.1 kHz => 10 % margin with respect to the Nyquist frequency (audible frequencies below 20 kHz) • 16-bit linear => theoretical SNR about 98 dB (for sinusoidal signal with maximum allowed amplitude) • audio bit rate 1.41 Mbit/s (44.1 kHz * 16 bits * 2 channels) • Cross Interleaved Reed-Solomon Code (CIRC) for error correction • Specifications • Playing time max. 74.7 min • Disc diameter 120 mm • Disc thickness 1.2 mm • One sided medium, rotates clockwise • Signal is recorded from inside to outside • Pit is about 0.5 µm wide • Pit edge is 1 and all other areas whether inside or outside a pit, are 0s
Motivation for Sound Compression • need to minimize transmission costs or provide cost efficient storage • demand to transmit over channels of limited capacity such as mobile radio channels • need to share capacity for different services (voice, audio, data, graphics, images) in integrated service network
Compression u-LAW sihttp://shuttle.nasa.gov/askmcc/answers/lence detection ADPCM (adaptive, delta PCM, 24/32/40 kbps) LPC-10E (Linear Predictive Coding 2.4kb/s) CELP 4.8Kb/s - builds on LPC GSM (European Cell Phones, RPE-LPC) 1650 bytes/sec (at 8000 samples/sec) RealAudio (builds on CELP, GSM, proprietary) MPEG Audio Layers (builds on ADPCM) Layer-2: From 32 kbps to 384 kbps - target bit rate of 128 kbps Layer-3: From 32 kbps to 320 kbps - target bit rate of 64 kbps Complex compression, using perceptual models
Compression u-LAW sihttp://shuttle.nasa.gov/askmcc/answers/lence detection ADPCM (adaptive, delta PCM, 24/32/40 kbps) LPC-10E (Linear Predictive Coding 2.4kb/s) CELP 4.8Kb/s - builds on LPC GSM (European Cell Phones, RPE-LPC) 1650 bytes/sec (at 8000 samples/sec) RealAudio (builds on CELP, GSM, proprietary) MPEG Audio Layers (builds on ADPCM) Layer-2: From 32 kbps to 384 kbps - target bit rate of 128 kbps Layer-3: From 32 kbps to 320 kbps - target bit rate of 64 kbps Complex compression, using perceptual models
Sound Editing • GoldWave (www.goldwave.com)- • requires a sound card. • digital audio sound player, recorder and editor • can load, play and edit many different file formats • .wav, .au, .voc, .snd • displays separate graphics for the left and right channels • very easy to use • good sound quality • Others: WHAM, Cool Edit, SOX, WINPLANY, Digital Audio Playback Facility, MOD4Win, etc.
Compression Approaches • Delta coding • Encode differences only • Predictive coding • Predict the next sample • Linear Predictive Coding (LPC) - mostly for speech • Describe fundamental frequencies + ‘error’ • CELP, cell-phone standards • Variable Rate Encoding • Don’t encode silences • Subband coding • Split into frequency bands each encoded separately + efficiently • Psycho-acoustical coding • drop bits where you can’t hear it
Tips for Audio on the Web • There is no generic audio standard on the Web • Listening to 16-bit sounds on an 8-bit system results in strange effects • Users will be annoyed if they spend a lot of time downloading a sound and they can’t play it • Distribute only 8-bit sounds on your Web page • Or, provide different sound files in both 8- and 16-bits • Record in the highest sampling rate and size you can, and then process down to 8-bit • Keep file size small • downsampling to 8-bit • use a lower sampling rate • use mono sounds • Describe what format those sounds are in • WAVE, AIFF, or other format • Providing the file size in the description is a politeness to help estimate download times • If you need high sound quality and have large audio files: • Use a smaller sound clip in m-law format as a preview • or for those who can’t to listen to the higher-quality sample. • Check out http://www.realaudio.com/help/content/audiohints.html.
Acoustic Modeling Describes the sounds that make up speech Speech Recognition Lexicon Describes which sequences of speech sounds make up valid words Language Model Describes the likelihood of various sequences of words being spoken Speech Recognition Knowledge Sources
Speech Recognition O is an acoustical ‘observation’ W is a ‘word’ we are trying to recognize Maximize w = argmax (P(w) | O) P(w|O) is unknown so by Bayes’ rule: P(O|w) P(w) P(w|O) = ------------------------ p(O)
Lexicon - links words to phones in acoustic model Aaron EH R AX N Aaron(2) AE R AX N abandon AX B AE N D AX N abandoned AX B AE N D AX N DD abandoning AX B AE N D AX N IX NG abandonment AX B AE N D AX N M AX N TD abated AX B EY DX IX DD abatement AX B EY TD M AX N TD abbey AE B IY Abbott AE B AX TD Abboud AA B UW DD abby AE B IY abducted AE BD D AH KD T IX DD Abdul AE BD D UW L
CONVERSATIONAL SPEECH 100 Non-English English 50 READ SPEECH 5000 word BROADCAST NEWS 20,000 Word 1000 Word vocabulary Varied microphones Word Error Rate (%) 10 Standard microphone Noisy environment Unlimited Vocabulary All results are Speaker -Independent 1 1988 1992 1993 1989 1990 1991 1994 1995 1996 1997 1998 Continual Progress in Speech Recognition Increasingly Difficult Tasks, Steadily Declining Error Rates NSA/Wayne/Doddington
References • http://www.nlc-bnc.ca/pubs/netnotes/notes24.htm • http://www.spies.com/sox (conversion tool) • http://freebsd.cdrom.com/.5/cica/sounds/gldwav21.zip • Sub-Band Coding: http://www.otolith.com/pub/u/howitt/sbc.tutorial.html • Speech Recognitionhttp://svr-www.eng.cam.ac.uk/comp.speech/Section2/speechlinks.html
Sound That’s all for today