200 likes | 310 Views
S ubband cocktail-party speech separation: CASA vs. BSS. Seungjin Choi Department of Computer Science and Engineering POSTECH, Korea seungjin@postech.ac.kr Co-work with Frederic Berthommier ICP, INPG, France. Number95 Stereo Database. ST-Numbers95 Database ICP/INP Grenoble
E N D
Subband cocktail-party speech separation: CASA vs. BSS • Seungjin Choi • Department of Computer Science and Engineering • POSTECH, Korea • seungjin@postech.ac.kr • Co-work with Frederic Berthommier • ICP, INPG, France
Number95 Stereo Database ST-Numbers95 Database ICP/INP Grenoble Authors: E.Tessier and F. Berthommier Left source Right source Reference Mixture A large database of binary mixtures of sentences (n=613) has been recorded by [Tessier and Berthommier, 1999]. The signal of Numbers95 is played by loudspeakers and recorded. The temporal overlap between words is about 75% and the relative level is 0dB. The setup is static. Only 332 mixture sentences truncated at 1 s are used in the present study.
Filterbank decomposition 1 1 0.8 0.8 0.6 0.6 Gain 1 0.4 0.4 2 nbsb= 0.2 0.2 0 0 100 4000 Hz 100 4000 Hz 1 1 0.8 0.8 0.6 0.6 Gain 0.4 0.4 3 4 0.2 0.2 0 0 100 4000 Hz 100 4000 Hz Frequency Frequency Subband processing
The CASA Model Filterbank decomposition TDOA estimation and weighting Resynthesis
Reconstruction Acuracy 1 Reference 0.8 0.6 Left source Frequency Rl 0.4 0.2 0 0 500 1000 1500 2000 2500 3000 Time 1 0.8 Left output 0.6 Frequency Yl 0.4 0.2 0 2 4 6 8 10 12 14 Frame of 1024 bins with half overlap RA (output) RA (mixture)
Gain of CASA : Relative Level 4 2 Gain left (dB) 0 -2 RAY RAX
Subband effect for CASA RA left RA right RA left+right 9.5 9.5 19 256 512 9 9 18 8.5 8.5 17 8 8 16 dB dB 7.5 7.5 15 7 7 14 6.5 6.5 13 6 6 12 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 nbsb nbsb nbsb dB Effect of the number of subbands (nbsb) for the CASA model on the RA (in dB). From left to right: averaged left source RA, averaged right source RA, averaged left+right RA over all frames. The number of subbands varies from 1 to 5 and the two curves correspond to duration= 256 and 512 bins. The RA of the mixture, which is subtracted for gain evaluation is labelled (*).
Effect of nbsb : RA Left 20 15 nbsb=4 nbsb=2 RA (dB) 10 5 nbsb=1 0 0 2 4 6 8 10 12 14 Right 15 10 RA (dB) 5 0 -5 0 2 4 6 8 10 12 14 Frame 1024 bins with half overlap Mixt. Left Right 2 4
Subband effect for CASA: Gain 4 2 0 Gain (dB) -2 -4 -6 -8 -50 -40 -30 -20 -10 0 10 20 30 40 50 Relative Level (dB) Left Right nbsb=1 nbsb=4
The BSS Model S Wrl Wlr S nbp 1 Yl(t) 0.9 0.8 0.7 Frequency 0.6 0.5 0.4 Yr(t) 0.3 0.2 0.1 Time 0 0 500 1000 1500 2000 2500 3000 3500 1 second Xl(t) Yl(t) Gain | Non linear function | Delayed output Xr(t) Yr(t)
Gain of BSS :Relative Level 6 2 -2 -6 Gain left (dB) RAY RAX
Subband effect for BSS left right left+right 10 10 20 9.5 9.5 19 9 9 18 8.5 8.5 17 8 8 16 dB dB dB 7.5 7.5 15 7 7 14 6.5 6.5 13 6 6 12 2 3 10 5.5 5.5 11 100 5 5 10 1 2 3 4 1 2 3 4 1 2 3 4 nbsb nbsb nbsb Effect of the number of subbands (nbsb) for the BSS model on the RA (in dB). From left to right: av. left source RA, av. right source RA, av. left+right RA over all frames. The number of subbands varies from 1 to 4 and the three curves correspond to nbp= 2,3,10, 100. The RA of the mixture is labelled (*). In each figures, two points are added at nbsb=1 for the "BSS giv" condition () and for "BSS ori" data ().
RA and Gain for BSS 20 10 RL (dB) 0 -10 0 2 4 6 8 10 12 14 Speech Separation Program (C++) POSTECH Authors: S. Choi and H. Hong Left 20 RAX 15 Mixt. - - 10 RA (dB) + 5 0 RAY -5 0 2 4 6 8 10 12 14 Left Right 15 10 - RA (dB) + 5 Right 0 -5 0 2 4 6 8 10 12 14 Frame 1024 bins with half overlap
Subband effect for BSS: Gain Gain of BSS (nbp=100) 6 4 2 Right 0 Left -2 Gain (dB) -4 nbsb=2 -6 nbsb=1 -8 -10 -12 -50 -40 -30 -20 -10 0 10 20 30 40 50 Relative Level (dB)
Demixing filters Wlr Wlr 0.3 500 Wlr 400 0.2 400 0.1 nbsb=1 300 300 0 200 -0.1 200 100 -0.2 100 0 0 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 140 160 180 200 0 0 10 20 30 40 50 60 70 80 90 100 Wrl Wrl 0.3 300 Wrl 250 0.2 0.1 200 200 0 150 -0.1 100 100 -0.2 50 0 0 10 20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 140 160 180 200 0 Frequency time (bin) 0 10 20 30 40 50 60 70 80 90 100 Frequency 1 0.8 0.6 0.4 0.2 0 20 40 60 80 100 120
Coherence spectrograms left 1 0.8 0.6 Frequency 0.4 0.2 0 0 500 1000 1500 2000 2500 3000 3500 Yl(n), Yl(n+1) Time NBP=10 Frames of 256 bins with half overlap Mean(Coh)=0.65 right 1 0.8 0.6 Frequency 0.4 0.2 0 0 500 1000 1500 2000 2500 3000 3500 Time Yr(n), Yr(n+1)
Effect of nbp: Coherence spectrograms Coh Left Right NBP=3 0.68 3 NBP=10 10 0.65 NBP=100 100 0.60
Coherence statistic left+right Coh 20 0.8 19 18 0.75 17 16 0.7 dB 15 14 0.65 13 2 12 0.6 3 10 11 100 10 0.55 1 2 3 4 1 2 3 4 nbsb nbsb Effect of the number of subbands (nbsb) on the coherence index for the BSS model. Left: average left+right RA over all frames. Right: coherence defined as the mean of the coherence spectrogram. The number of subbands varies from 1 to 4 and the three curves correspond to nbp= 2,3,10, 100. The RA of the mixture is labelled (*). The CohX coherence between the two mixture channels is labelled (*) in the right figure. In each figures, two points are added at nbsb=1 for the "BSS giv" condition () and for "BSS ori" data ().
Summary results … Hearing REF CASA BSS Left Right Right mean Left