190 likes | 199 Views
This work introduces a framework for extracting and analyzing social networks from audio-visual interactions, offering insights into human behavior through nonverbal communication cues. By combining audio-visual fusion techniques with social network analysis, the study provides a novel approach to understanding social interactions. The experimental results showcase the efficacy of the proposed methodology in characterizing human behavior based on nonverbal signals and social network structures.
E N D
Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu
Outline • Introduction • Audio – Visual cues extraction and fusion • Social Network extraction and analysis • Experimental Results • Conclusions and future work
Introduction • Social interactions play a very important role in people’s daily lives. • Present trend: analysis of human behavior based on electronic communications: SMS, e-mails, chat • New trend: analysis of human behavior based on nonverbal communication: social signals • Quantification of social signals represents a powerful cue to characterize human behavior: facial expression, hand and body gestures, focus of attention, voice prosody, etc.
Social Network Analysis (SNA) has been developed as a tool to model social interactions in terms of a graph-based structure: - ‘Nodes’ represent the ‘actors’: persons, communities, institutions, etc. - ‘Links’ represent a specific type of interdepency: friendship, familiarity, business transactions, etc. A common way to characterize the information ‘encoded’ in a SNA is to use several centrality measures.
Our contribution: • In this work, we propose an integrated framework for extraction and analysis of a SNA from multimodal (A/V) dyadic interactions* • The advantage is represented by the fact that it is based on a totally non-intrunsive technology • First: we perform speech segmentation through an audio/visual fusion scheme - In the audio domain, speech is detected through clusterization of audio features - In the visual domain, speech is detected through differential-based feature extraction from the segmented mouth region - The fusion scheme is based on stacked sequential learning *We used a set of videos belonging to the New York Times’ Blogging heads opinion blog. The videos depict two persons talking on different subject in front of a webcam
- Second: To quantify the dyadic interaction, we used the ‘Influence Model’, whose states encode previously integrated audio-visual data - Third: The Social Network is extracted based on the estimated influences* and its properties are characterized based on several centrality measures. Block-diagram representation of our integrated framework * The use of term ‘influence’ is inspired by the previous work of Choudhury: T. Choudhury, 2003. “Sensing and Modelling Human Networks”, Ph.D. Thesis, MIT Media Lab
2. Audio – Visual cues extraction and fusion • Audio cue • Description • 12 first MFCC coefficients • Signal energy • Temporal cepstral derivatives (Δ and Δ2)
Audio cue • Diarization process • Segmentation • Coarse segmentation according Generalized Likelihood ratio between consecutive windows • Clustering • Agglomerative hierarchical clustering with a BIC stopping scheme • Segments boundaries are adjusted at the end
Visual cue • Description: • Face segmentation based on Viola-Jones detector • Mouth region segmentation • Vector of HOG descriptors for for the mouth region
Visual cue • Classification: • Non-Speech class modelling • One-class Dynamic Time warping based on the following dynamic programming equation
Fusion scheme • Stacked sequential learning (suitable for problems characterized by long runs of identical labels) • Fusion of audio-visual modalities • Determining temporal relations of both feature sets for learning a two-stage classifier (based on Ada-Boost)
3. Social Network extraction and analysis • Influence Model (IM), was a tool introduced for quantification of interacting processes using a coupled Hidden Markov Model (HMM) • In the case of social interaction, the states of IM encode automatically extracted audio-visual features parameters represent the ‘influences’ Influence Model Architecture
- The construction of the Social Network is based on ‘influences’ values • A directed link between two nodes A and B (designated by A → B) implies that ‘A has influence over B’ • The SNA is based on several centrality measures: - degree centrality (indegree and outdegree) - Refers to the number of direct connections with other persons - closeness centrality - Refers to the facility between two persons to communicate - betweeness centrality - Refers to the relevance of a person to act as a ‘bridge’ between two sub-groups of the network - eigenvector centrality - Refers to the importance of a person in the network
4. Experimental results • We collected a subset of videos from the New York Blogging Heads’ opinion blog • We used 17 videos from 15 persons • Videos depict two persons having a conversation in front of their webcam on different topics (politics, economy,…) • The conversations have an informal character and sometimes frequent interruptions can occur Snapshot from a video
Audio features - The audio stream has been analyzed using sliding windows of 25 ms with an overlapping factor of 50%. - Each window is characterized by 13 features (12 MFCC +E), complemented with Δ and Δ2 - The shortest length of a valid audio segment was set to 2.5 ms • Video features - 32 oriented features (corresponding to the mouth region) have been extracted using the HOG descriptor - the length of the DTW sequences has been set to 18 frames (which corresponds to 1.5 s) • Fusion process - stacked sequential learning was used to fusion the audio-visual features - Adaboost was chosen as classifier
The extracted social network showing participants’ label and influence directions
5. Conclusions and future work • - We presented an integrated framework for automatic extraction and analysis of a social network from im- • plicit input (multimodal dyadic interactions), based on the • integration of audio/visual features. • In the future, we are planning to extend the current work to study the problem of social interactions at larger scale and in different scenarios • - Starting from the premise that people's lives are more structured than it might seem a priori, we plan to study long-term interactions between persons, with the aim to discover underlying behavioral patterns present in our day-to-day existence