1 / 54

Spatio-Temporal Analysis of Multimodal Speaker Activity

Spatio-Temporal Analysis of Multimodal Speaker Activity. Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP. Context. Spontaneous multi-party speech. Goal: extract salient information: Who? What? When? Where? Automatic meeting annotation/transcription.

Download Presentation

Spatio-Temporal Analysis of Multimodal Speaker Activity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Spatio-Temporal Analysis of Multimodal Speaker Activity Guillaume Lathoud, IDIAP Supervised by Dr Iain McCowan, IDIAP

  2. Context • Spontaneous multi-party speech. • Goal: extract salient information: • Who? What? When? Where? • Automatic meeting annotation/transcription. • Speaker tracking, speech acquisition. • Surveillance.

  3. Context • Spontaneous multi-party speech. • Goal: extract salient information: • Who? What? When? Where? • Automatic meeting annotation/transcription. • Speaker tracking, speech acquisition. • Surveillance. • Approach: based on speaker location.

  4. Context • Spontaneous multi-partyspeech. • Goal: extract salient information: • Who? What? When? Where? • Automatic meeting annotation/transcription. • Speaker tracking, speech acquisition. • Surveillance. • Approach: based on speaker location. • Multisource problem (overlaps, noise).

  5. How? • Audiolocation: microphone array • Audio content: speaker identification. • Video: one or several cameras. • Combination.

  6. How? • Audiolocation: microphone array • Audio content: speaker identification. • Video: one or several cameras. • Combination.

  7. Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete)

  8. Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress)

  9. Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress) Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s

  10. Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress) Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s

  11. Task 1: AV16.3 corpus • At IDIAP: 16 microphones, 3 cameras. • 40 short recordings, about 1h30 overall. • ‘meeting’: seated. • ‘surveillance’: standing. • pathological test cases (A, V, AV).

  12. Task 1: AV16.3 corpus • At IDIAP: 16 microphones, 3 cameras. • 40 short recordings, about 1h30 overall. • ‘meeting’: seated. • ‘surveillance’: standing. • pathological test cases (A, V, AV). • 3D mouth annotation. • Used in the AMI project. • http://mmm.idiap.ch/Lathoud/av16.3_v6

  13. AV16.3 corpus

  14. Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress) Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s

  15. Task 2: Multisource Localization • Problem: • Detect: how many speakers? • Localize: where?

  16. Sector-based Approach Question: is there at least one active source in a given sector?

  17. Task 2: Multisource Localization • Problem: • Detect: how many speakers? • Localize: where? • Sectors (coarse-to-fine).

  18. Task 2: Multisource Localization • Problem: • Detect: how many speakers? • Localize: where? • Sectors (coarse-to-fine). • Tested on real data: AV16.3 corpus. • To do: • Finalize (optimization, multi-level). • Compare with existing.

  19. Task 2: Single Speaker Example

  20. Task 2: Multiple Loudspeakers 2 loudspeakers simultaneously active

  21. Task 2: Multiple Loudspeakers 2 loudspeakers simultaneously active

  22. Task 2: Multiple Loudspeakers 3 loudspeakers simultaneously active

  23. Real data: Humans 2 speakers simultaneously active (includes short silences)

  24. Real data: Humans 3 speakers simultaneously active (includes short silences)

  25. Audio: Global Picture Multiple waveforms Resolution 0.0000625 s Task 1: data acquisition and annotation (complete) Active speakers’ locations Resolution 0.016 s Task 2: develop robust multisource strategies (in progress) Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s

  26. Task 3: Segmentation/Tracking • Speech: • Short and sporadic utterances. • Overlaps. • Filtering is difficult (Kalman, PF).

  27. Task 3: Segmentation/Tracking • Speech: • Short and sporadic utterances. • Overlaps. • Filtering is difficult (Kalman, PF). • Alternative: short-term clustering.

  28. Task 3: Segmentation/Tracking • Speech: • Short and sporadic utterances. • Overlaps. • Filtering is difficult (Kalman, PF). • Alternative: short-term clustering. • Short-term = 0.25 s. • Threshold-free, online, unsupervised. • Unknown number of objects.

  29. Example: iteration 1 (partition)

  30. Example: iteration 1 (merge)

  31. Example: iteration 2 (partition)

  32. Example: iteration 2 (merge)

  33. Example: iteration 3 (partition)

  34. Example: iteration 3 (merge)

  35. Example: iteration 4 (partition)

  36. Example: iteration 4 (merge)

  37. Example: result

  38. Example: result

  39. Example: result

  40. Task 3: Application Annotated IDIAP corpus of short meetings (total 1h45) http://mmm.idiap.ch Single source localization Task 3: joint segmentation and tracking (complete) Link locations across space and time Resolution 0.016 s Links up to 0.25 s

  41. Application (2)

  42. Task 3: Metrics • Precision (PRC): • An active speaker is detected in the result. • PRC = probability that he is truly active.

  43. Task 3: Metrics • Precision (PRC): • An active speaker is detected in the result. • PRC = probability that he is truly active. • Recall (RCL): • A speaker is truly active. • RCL = probability to detect him in the result.

  44. Task 3: Metrics • Precision (PRC): • An active speaker is detected in the result. • PRC = probability that he is truly active. • Recall (RCL): • A speaker is truly active. • RCL = probability to detect him in the result. • F-measure: F = 2 * PRC * RCL PRC + RCL

  45. Task 3: Results Entire data: F = 2 * PRC * RCL PRC + RCL

  46. Task 3: Results Entire data: F = 2 * PRC * RCL PRC + RCL Overlaps only:

  47. Conclusion • Spontaneous speech = multisource problem. • AV16.3 corpus recorded, annotated. • Approach: detect, localize, track, segment. • Location is not identity! • Fusion with monochannel analysis. • Fusion with video.

  48. Thank you!

More Related