540 likes | 683 Views
Spatio-Temporal Analysis of Multimodal Speaker Activity. Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP. Context. Spontaneous multi-party speech. Goal: extract salient information: Who? What? When? Where? Automatic meeting annotation/transcription.
E N D
Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP
Context • Spontaneous multi-party speech. • Goal: extract salient information: • Who? What? When? Where? • Automatic meeting annotation/transcription. • Speaker tracking, speech acquisition. • Surveillance.
Context • Spontaneous multi-party speech. • Goal: extract salient information: • Who? What? When? Where? • Automatic meeting annotation/transcription. • Speaker tracking, speech acquisition. • Surveillance. • Approach: based on speaker location.
Context • Spontaneous multi-partyspeech. • Goal: extract salient information: • Who? What? When? Where? • Automatic meeting annotation/transcription. • Speaker tracking, speech acquisition. • Surveillance. • Approach: based on speaker location. • Multisource problem (overlaps, noise).
How? • Audiolocation: microphone array • Audio content: speaker identification. • Video: one or several cameras. • Combination.
How? • Audiolocation: microphone array • Audio content: speaker identification. • Video: one or several cameras. • Combination.
Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete)
Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress)
Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress) Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s
Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress) Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s
Task 1: AV16.3 corpus • At IDIAP: 16 microphones, 3 cameras. • 40 short recordings, about 1h30 overall. • ‘meeting’: seated. • ‘surveillance’: standing. • pathological test cases (A, V, AV).
Task 1: AV16.3 corpus • At IDIAP: 16 microphones, 3 cameras. • 40 short recordings, about 1h30 overall. • ‘meeting’: seated. • ‘surveillance’: standing. • pathological test cases (A, V, AV). • 3D mouth annotation. • Used in the AMI project. • http://mmm.idiap.ch/Lathoud/av16.3_v6
Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress) Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s
Task 2: Multisource Localization • Problem: • Detect: how many speakers? • Localize: where?
Sector-based Approach Question: is there at least one active source in a given sector?
Task 2: Multisource Localization • Problem: • Detect: how many speakers? • Localize: where? • Sectors (coarse-to-fine).
Task 2: Multisource Localization • Problem: • Detect: how many speakers? • Localize: where? • Sectors (coarse-to-fine). • Tested on real data: AV16.3 corpus. • To do: • Finalize (optimization, multi-level). • Compare with existing.
Task 2: Multiple Loudspeakers 2 loudspeakers simultaneously active
Task 2: Multiple Loudspeakers 2 loudspeakers simultaneously active
Task 2: Multiple Loudspeakers 3 loudspeakers simultaneously active
Real data: Humans 2 speakers simultaneously active (includes short silences)
Real data: Humans 3 speakers simultaneously active (includes short silences)
Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress) Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s
Task 3: Segmentation/Tracking • Speech: • Short and sporadic utterances. • Overlaps. • Filtering is difficult (Kalman, PF).
Task 3: Segmentation/Tracking • Speech: • Short and sporadic utterances. • Overlaps. • Filtering is difficult (Kalman, PF). • Alternative: short-term clustering.
Task 3: Segmentation/Tracking • Speech: • Short and sporadic utterances. • Overlaps. • Filtering is difficult (Kalman, PF). • Alternative: short-term clustering. • Short-term = 0.25 s. • Threshold-free, online, unsupervised. • Unknown number of objects.
Task 3: Application Annotated IDIAP corpus of short meetings (total 1h45) http://mmm.idiap.ch Single source localization Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s
Task 3: Metrics • Precision (PRC): • An active speaker is detected in the result. • PRC = probability that he is truly active.
Task 3: Metrics • Precision (PRC): • An active speaker is detected in the result. • PRC = probability that he is truly active. • Recall (RCL): • A speaker is truly active. • RCL = probability to detect him in the result.
Task 3: Metrics • Precision (PRC): • An active speaker is detected in the result. • PRC = probability that he is truly active. • Recall (RCL): • A speaker is truly active. • RCL = probability to detect him in the result. • F-measure: F = 2 * PRC * RCL PRC + RCL
Task 3: Results Entire data: F = 2 * PRC * RCL PRC + RCL
Task 3: Results Entire data: F = 2 * PRC * RCL PRC + RCL Overlaps only:
Conclusion • Spontaneous speech = multisource problem. • AV16.3 corpus recorded, annotated. • Approach: detect, localize, track, segment. • Location is not identity! • Fusion with monochannel analysis. • Fusion with video.