230 likes | 491 Views
Chapter 3 Time Domain Analysis of Speech Signal. 3.1 Short-time windowing signal (1). π. Three types windows : Rectangular window h r [n] = u[n] – u[n – N] H r (e j ω ) = (sin(ωN/2)/sinω/2)e -jω(N-1) /2 General Hamming window H h [n] = (1-α) – αcos(2πn/N) 0 ≤ n < N
E N D
3.1 Short-time windowing signal (1) π • Three types windows : • Rectangular window • hr[n] = u[n] – u[n – N] • Hr(ejω) = (sin(ωN/2)/sinω/2)e-jω(N-1) /2 • General Hamming window • Hh[n] = (1-α) – αcos(2πn/N) 0 ≤ n < N • = (1-α) hr[n] - α hr[n] cos(2πn/N) • Hh(ejω) = (1-α)Hr (ejω) - (α/2)Hr [ej(ω-2π/N) ] - (α/2)Hr[ej(ω+2π/N) ] • α=0.5 hanning window, α=0.46 hamming win • Windowed signal is xw(n) = x(n) w(n) π
Short-time windowing signal (2) • Qn = Σm=n-N+1n T[x(m)]w(n-m) • This is another representation for analysis. Window length is limited, so the values of Qn is a sequence of local weighted average values of the sequence T[x(m)]. • T[ ] is a linear or nonlinear transformation. • Qn describes the short-time property of speech signal.
3.2 Time domain parameters (1) • 3.2.1 Short-time Energy and short-time average amplitude • En = Σm=nn+N-1 xw2(m) (by using rectangle window) • the summation is from n to n+N-1 • For voiced segment (or frame) En is large, for unvoiced segment it is small • En is too sensitive to large signal levels • Mn = Σm=nn+N-1|xw(m)|/N • Mn also describes the average intensity of the signal
Time domain parameters (2) • 3.2.2 Short-time average zero-crossing rate • Zn = Σm=nn+N-1|sgn[xw(m)] - sgn[xw(m-1)]| • where sgn(x) = 1 x ≥ 0 • = -1 x < 0 • Zn can roughly estimate the frequency of signal • Multiple threshold for zero-crossing: • Zni = Σm=nn+N-1{|sgn[xw(m)-Ti] - sgn[xw(m-1)-Ti]| + |sgn[xw(m)+Ti] - sgn[xw(m-1)+Ti]|}, i=1,2,3,… • It has some ability to avoid interference of low frequency. Random noise won’t contribute to Zni.
Time domain parameters (3) • 3.2.3 Short-time auto-correlation function • Rw(k) =Σm=0N-k-1 xw(m)xw(m+k) • Rw(k) = Rw(-k) =Σm=kN-1 xw(m)xw(m-k) • Rw(k) = 0 for k<-N+1 or k>N-1 • Rw(0) = Σm=0N-1 xw2(m) >= Rw(k)
Time domain parameters (4) • 3.2.4 Short-time frequency and power spectrum • Xw(exp(jω)) = Σn=0N-1 xw(n)exp(-jωn) is short-time frequency spectrum • |Xw(exp(jω))|2 is called short-time power spectrum density • |Xw(exp(jω))|2 = Σ-N+1N-1Rw(n)exp(- jωn) • Short-time auto-correlation function and power spectrum is an important pair of parameter
Time domain parameters (5) • 3.2.5 Short-time Average Magnitude Difference Function • rw(k) = Σm=0N-k-1|xw(m+k) - xw(m)| • AMDF is implemented with subtraction, addition, and absolute value operations, in contrast to addition and multiplication operation for the auto-correlation function.
3.3 S/U/V detection • S-silence, U-unvoiced, V-voiced are three basic speech states • S, U and V are random, they have different distributions (close to normal distribution). • For voiced, M is max, Z is min(20/160) • For unvoiced, Z is max (70/160), M is mid • For silence, M is min, Z is mid
3.4 Endpoint detection • 3.4.1 double threshold beginning detection • Set two thresholds Th and Tl for the En or Mn to get the real starting and ending points; for unvoiced, the Zn is used to differ the starting point to silence. • 3.4.2 multi zero-crossing threshold beginning detection • Set T1<T2<T3, for every frame find their Z1, Z2,Z3 and Z=W1Z1+W2Z2+W3Z3 • If Z>Z0 the frame is voiced, otherwise unvoiced
3.5 Pitch period (Tp) estimation (1) • 3.5.1 preprocessing • 1. Center clipping • x(n)-CL x(n) > CL • y(n)=C[x(n)]= 0 |x(n)|<=CL • x(n)+CL x(n) < -CL • 2. Low pass filter (900Hz) with linear phase • 3. Three levels of clipping • y’(n)=C’[y(n)]=1,0,-1 if y(n)>0,=0,<0
Pitch period (Tp) estimation (2) • 3.5.2 pitch detection by auto-correlation function • 1. 900Hz low pass filtering, deleting first 20 signals {x(n)} {x’(n)} • 2. CL = 0.68 max {x’(n)} • 3. y(n) = C[x’(n)] 20<n<300 y’(n) = C’[y(n)] 20<n<300 • 4. R(k) = y(n)y’(n+k) k=0,20,21,…,150
Pitch period (Tp) estimation (3) • 5. Rmax = max { R20 ~ R150 } • 6. If Rmax < 0.25R(0) then Tp=0 (unvoiced) else Tp=argmax20<k<150 R(k)xT (voiced) • 3.5.3 pitch detection by average difference of amplitude • 1. Same as above. 900Hz filtering • 2. r(k) = |x’(n+m) – x’(n+m-k)|/140 k=21,22,…,140
Pitch period (Tp) estimation (3) • 3. Tp’ = argmink r(k), rmin =mink r(k) • 4. Check if rmin>a1,Tp=0 (unvoiced); if rmin/M<a1, voiced; (M= |x’(n)|/280) • 5. Check if rmin(Tp’/2)/M<a2, Tp = Tp’/2 else if rmin(Tp’/3)/M<a2, Tp = Tp’/3 • ai is determined by experimental statistics • If there are I frames, pi is the correct pitch estimation of frame i, a2 = mini ri(pi)/Mi • a2 < a1(for unvoiced)
Pitch period (Tp) estimation (4) • 3.5.4 post-processing of pitch detect • Smoothing processing by median filtering : y(n) = mediann-Ln+L [x(n)] • Linear smoothing : y(n)=Σm=-LL x(n-m)w(m), Σm=-LLw(m)=1 • Smoothing processing by dynamic programming : p1, p2 ,…pN for smoothing • Define cost function (B>0) C(i,j)=|(Pi-Pj)/(i-j)| - B (i!=j) or -B (i=j)
Pitch period (Tp) estimation (5) • D(i) is the cost of i-th step, the track steps: • 1. i=1, D(j)=0, j=1~N • 2. Calculate C(i,j), j=1~i • 3. d(i,j)=D(j)+C(i,j),j=1~I • 4. Find optimal path: D(i)=minj=1~i d(i,j) J(i) = argminj=1~i d(i,j) • 5. If I<N goto 2 • 6. Smooth result: Pi = Pj(i), i=1~N • C(i,j)means the cost for replacing Pi with Pj