音频信号处理（基础篇）

音频信号处理（基础篇）

0 开胃酒 参考文献 1) 本领域的学科发展 2) 本领域的技术发展

参考文献 网络

哪些素质（能力）是重要的？ 一个项目的研发过程有什么英语 “物理”概念思路是什么数学为什么工具怎么做

1 入手：实验的原材料 Wav文件例子：keep friends with.wav

格式区 数据区

偏移地址字节数数据类型内容 00H 4 char "RIFF"标志 04H 4 long 文件长度，'File length'-8, so, is 'data length'+0x24 (File length = data length + 0x2c) 08H 4 char "WAVE"标志 0CH 4 char "fmt"标志 10H 4 　过渡字节（不定） 14H 2 int 格式类别（10H为PCM形式的声音数据) 16H 2 int 通道数，单声道为1，双声道为2 18H 4 long 采样率（每秒样本数） 1CH 4 long 波形音频数据传送速率，其值为通道数×每秒数据位数×每样本的数据位数／8。播放软件利用此值可以估计缓冲区的大小。

20H 2 int 数据块的调整数（按字节算的），其值为通道数× 每样本的数据位值／8。播放软件需要一次处理多个该值大小的字节数据，以便将其值用于缓冲区的调整。 22H 2 　每样本的数据位数，表示每个声道中各个样本的数据位数。如果有多个声道，对每个声道而言，样本大小都一样。 24H 4 char 数据标记符＂data＂ 28H 4 long 语音数据的长度

typedef struct { char Riff[4]; unsigned long sizeOfFile; char WAVEfmt[8]; unsigned long sizeOfFmt; short int wFormatTag; short int nChannels; unsigned long nSamplesPerSec; unsigned long navgBytesPerSec; short int nBlockAlign; unsigned short nBitPerSample; char Cdata[4]; unsigned long sizeOfData; } HeadOfWave;

几个说明。 * 文件长度和数据长度 * 关键量：采样率/声道数/量化模式/量化bit * navgBytesPerSec和nBlockAlign的计算 * 程序举例和说明

2 基本概念 采样率量化bit

2.1 采样率 48k/44k/32k/22k/16k/11k/8kHz 两条线： 44k/22k/11k 32k/16k/8k 为什么是这些值？

2.2 音频信号的带宽 文件 keep_friend_with.wav （采样率44kHz）代表频率，32是22kHz 7kHz

22kHz 4kHz

文件 keep_friend_with_8k.wav （采样率8kHz） 4kHz

上述文件很特殊。采集环境很好。 一般认为： * 语音（speech） 300－3400kHz，采样率8kHz * 宽带语音（wide-band speech）带宽7kHz（50-7k），采样率16kHz * 音频（audio）带宽20kHz（20-20k），采样率44.1kHz，48kHz

2.2 音频信号的带宽 采样率为什么是那些值？ Nyquist Sampling Theorem 为什么44.1kHz？ 20kHz ->(Nyquist) 40kHz->(Rolloff from passband to stopband ) 44kHz -> 44.1kHz?

At the time the choice was made, only recorders capable of storing such high rates were VCRs. NTSC: 490 lines/frame, 3 samples/line, 30 frames/s = 44100 samples/s PAL: 588 lines/frame, 3 samples/line, 25 frames/s = 44100 samples/s Prof. Brian L. Evans Dept. of Electrical and Computer Engineering The University of Texas at Austin

Listen to the sounds… keep_friends_with(44k_mono).wav keep_friends_with(22k_mono).wav keep_friends_with(16k_mono).wav keep_friends_with(11k_mono).wav keep_friends_with(8k_mono).wav

对语音信号，8kHz/11kHz 采样率是一个效果； 16kHz采样率以上是一个效果。所以，对语音信号而言，分为voice/wideband speech就可以了。

2.2 量化bits 线性量化/非线性量化量化信噪比：6b dB。 6.02b + 1.76 复读机规范：声音从磁带上复读到芯片上，再用耳机听芯片上的声音时有用信号和噪声之间的幅度差，标准规定≥34dB。

Listen to the sounds… keep_friends_with(16k_mono).wav keep_friends_with(16k_mono)_8b.wav 8bit线性量化的文件，明显带了背景噪声。从经验出发，可接受的量化bit，应该是？

入手：实验的原材料 16kHz or 8kHz采样率的语音文件； 44.1kHz采样率的音乐文件； 16bit or 14bit 线性量化；

3 我常用的音频处理的工具 VC6.0, using c; matlab cooledit

Math. environment Signal processing toolbox : filter-design, spectral analysis, waveform generation, linear prediction voicebox Matlab (Mathworks)

pros: open, powerful, scripting, excellent plotting cons: poor speech community, standards, not designed for big files Matlab (Mathworks)

其它的语音分析工具？ • Goldwave (audio editor) • Esps Xwaves (routines + visual.) • Praat (speech analysis) • Wavesurfer (speech editor) • Transcriber (annotation tool) • OGI speech tools (routines + app. dev.) • …winpitch, pitchworks, phonedit…..

self-defined as “top rated, professional digital audio editor” Goldwave

pros : edition (good gestion of memory for big files), many FX, noise reduction, real-time spectrum and VU meters, various formats, batch conversion, chain effects, easy interface cons: nothing for speech (pitch, formant), windows only, no scripting Good for file edition not for speech Goldwave

Developed by Entropic + AT&T. Now public Comp.speech FAQ says: Esps: comprehensive set of speech analysis/processing tools Waves is a graphical front-end for speech processing (waveforms, spectrograms, pitch) includes a signal labeling utility Esps - Waves

pros: powerful, designed for big files, cons: UNIX only (free BSD), not standard formats, requires programming skills, development has stopped Esps – waves

Developed by P.Boersma and D.Weenink at the Institute of Phonetic Sciences, University of Amsterdam general purpose speech tool : edition, segmentation and labeling, prosodic manipulation Praat

pros: designed for speech analysis (not only sound edition or spectrogram visualization), nice GUI, scripting, active development and community, prosodic manipulation cons: limited scripting language, native format of transcription and pitch files Praat

Open Source tool for sound visualization and manipulation speech/sound analysis and sound annotation/transcription platform for more advanced/specialized applications: extending WaveSurfer with new custom plug-ins or embedding WaveSurfer visualization components in other applications Requires SnackToolKit WaveSurfer

Authors: C. Barras, E. Geoffrois Relies on Snack (Tcl/tk) Good for annotation Nice, simple GUI No speech analysis Transcriber

development started in 1992 in C on Unix, at Center for Spoken Language Understanding (CSLU) at OGI Includes : An X windows display tool (LYRE) display, edit speech signal, spectrograms, phoneme labels, and other information a set of C library routines (LIBNSPEECH), utilities for converting file formats, filtering, Neural Network training, vector-quantizer, database utility to automate speech database related enquiries a set of PERL Scripts which have been used mainly to automate the use of the OGI Speech Tools. MAN Pages RAD rapid application development points of entry: Package(C), script(tcl), GUI(tk) levels free for research use OGI speech tools/CSLU Toolkit

Summary = yes but requires some dev.

音频信号处理（基础篇）

音频信号处理（基础篇）

Presentation Transcript