140 likes | 274 Views
Collection of multimodal data Face – Speech – Body. George Caridakis ICCS Ginevra Castellano DIST Loic Kessous TAU. Overview. Objectives Scenario Equipment specifications Subjects & Procedure Visual aspects Acoustic aspects Future processing Please try this at home…. Objectives.
E N D
Collection of multimodal dataFace – Speech – Body George Caridakis ICCS Ginevra Castellano DIST Loic Kessous TAU
Overview • Objectives • Scenario • Equipment specifications • Subjects & Procedure • Visual aspects • Acoustic aspects • Future processing • Please try this at home…
Objectives • Collection of emotional multimodal data • Process different modalities • Holy Grail: “EMOTION RECOGNITION”
Scenario • Inspired by GEMEP corpus • Pseudo-language sentence(“Toko”, damato ma gali sa) • Standing body posture • 10 subjects • 8 emotions uniformly distributed through the quadrants (2D emotion theory, valence-arousal) • 3 repetitions of emotion specific gesture • 3 repetitions of emotion independent gesture
Equipment specifications • 2 DV cameras • Full body • Face • Wireless microphone (shirt-mounted) • PC + External sound card • Uniform dark background • 2 artificial light sources • Light coloured, long sleeves shirt ;)
Subjects & Procedure • Subjects • 10 “actors” • 6 males • 4 females • despair, hot anger, irritation sadness, interest, pleasure, joy, pride Procedure • Subject instructions • Clap before every execution: synchronize streams
Video quality issues • Highest possible resolution • Progressive video (not interlaced) • Correct exposure • Good color quality • No compression artifacts • Uniform lighting
Interlacing / Over-exposure • Interlacing / De-Interlacing • Over-exposure • 70% zebra pattern • Prefer lower-exposure so signal will not be clipped
Colour/Lighting • Medium Y/C Resolution • Compression Artifacts • Exposure • Good Video quality • Source: DV
Archiving PAL: 720x576 @ 25 frames/second • DV Format: ~36Mbit/sec • ~16 GBytes/hour • MPEG2 @ 4-8Mbit/sec (DVD quality) • ~1.8-3.5 GB/hour • MPEG-1 @ 1.1 Mbit/sec • ~500MBytes/hour
Visual Aspects Summary • Video Camera • DV or Better • Progressive Scan Capability • Over-Exposure Indication, Zebra Patterns • Shooting • Use the zebra patterns at 70% • Zoom in as much as possible to increase subject’s resolution • Facial features must be visible for facial analysis • Try to avoid occlusions (hair, glasses, clothes, hand movement) • Uniform lighting conditions • Archive DV tapes, DV Video or Frames, (not MPEG-1)
Acoustic aspects • Why: “Toko, damato ma gali sa”? • Toko: solicitation by naming the interlocutor • Vowels found in majority of language • Meaning: Toko, can you open it? (request) for maintaining semantic aspect • Sampling frequency 44.1 kHz • 16 bits mono information depth • Uncompressed .wav files
Future processing • Process different modalities • Facial feature extraction • Gesture expressiveness analysis • Acoustic analysis • Gesture recognition • Synchronization • Modalities fusion • RNN • RSOM + Markov • SVM • … • Emotion recognition