400 likes | 525 Views
Multiple Audio Sources Detection and Localization. Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP. Outline. Context and problem. Approach. Discretize: ( sector, time frame, frequency bin ). Example. Experiments. Multiple loudspeakers. Multiple humans. Conclusion. Context.
E N D
Multiple Audio Sources Detection and Localization Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Context • Automatic analysis of recordings: • Meeting annotation. • Speaker tracking for speech acquisition. • Surveillance applications.
Context • Automatic analysis of recordings: • Meeting annotation. • Speaker tracking for speech acquisition. • Surveillance applications. • Questions to answer: • Who? What? Where? When? • Location can be used for very precise segmentation.
Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps.
Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps. • Problem: frame-levelmultisoure localization and detection. One frame = 16 ms.
Why Multiple Sources? • Spontaneous multi-party speech: • Short. • Sporadic. • Overlaps. • Problem: frame-level multisoure localization and detection. One frame = 16 ms. • Many localization methods exist…But: • Speech is wideband. • Detection issue: how many?
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Sector-based Approach Question: is there at least one active source in a given sector?
Sector-based Approach Question: is there at least one active source in a given sector? Answer it for each frequency bin separately
Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. s Sector of space f Frequency bin
Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption [Roweis 03]. s Sector of space f Frequency bin
Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption[Roweis 03]. s 0 Sector of space 9 2 0 10 0 1 f Frequency bin
Frame-level Analysis • One time frame every 16 ms. • Discretize both space and frequency. • Sparsity assumption[Roweis 03]. s 0 Sector of space 9 2 0 10 0 1 f Frequency bin
Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. P=M(M-1)/2
Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. • For each sector s, compare measured phases Q(f) with the centroidFs: pseudo-distance d( Q(f), Fs ). P=M(M-1)/2 d( Q(f), F1 ) d( Q(f), F2 ) sector d( Q(f), F3 ) … d( Q(f), F7 ) f
Frequency Bin Analysis • Compute phase between 2 microphones: q(f) in [-p,+p]. • Repeat for all P microphone pairs: Q(f) = [q1(f) …qP(f)]. • For each sector s, compare measured phases Q(f) with the centroid Fs: pseudo-distance d( Q(f), Fs ). • Apply sparsity assumption: • The best one only is active. P=M(M-1)/2
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01]
Real Data: Single Speaker Without sparsity assumption [SAPA 04] similar to [ICASSP 01] With sparsity assumption (this work)
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Task 2: Multiple Loudspeakers 2 loudspeakers simultaneously active
Real Data: Multiple Loudspeakers 2 loudspeakers simultaneously active
Real Data: Multiple Loudspeakers 3 loudspeakers simultaneously active
Outline • Context and problem. • Approach. • Discretize: ( sector, time frame, frequency bin ). • Example. • Experiments. • Multiple loudspeakers. • Multiple humans. • Conclusion.
Real data: Humans 2 speakers simultaneously active (includes short silences)
Real data: Humans 3 speakers simultaneously active (includes short silences)
Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data.
Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data. • Current work: • Optimize centroids. • Multi-level implementation. • Compare multilevel with existing methods.
Conclusion • Sector-based approach. • Localization and detection. • Effective on real multispeaker data. • Current work: • Optimize centroids. • Multi-level implementation. • Compare multilevel with existing methods. • Possible integration with Daimler.
Pseudo-distance • Measured phases Q(f) = [q1(f) …qP(f)]in [-p,+p]P. • For each sector a centroid Fs=[Fs,1… Fs,P]. • d( Q(f), Fs ) = Sp sin2( (qp(f) – Fs,p) / 2 ) • cos(x) = 1 – 2 sin2( x / 2 ) argmax beamformed energy = argmin d
Delay-sum vs Proposed (1/3) With delay-sum centroids (this work) With optimized centroids (this work)
Delay-sum vs Proposed (2/3) 2 loudspeakers simultaneously active 3 loudspeakers simultaneously active
Delay-sum vs Proposed (3/3) 2 humans simultaneously active 3 humans simultaneously active