170 likes | 391 Views
Speaking Faces Verification. Kevin McTait Raphaël Blouet Gérard Chollet Silvia Col ó n Guido Aversano. Outline. - Speaking faces verification problem - State of the art in speaking faces verification - Choice of system architecture - Fusion of audio and visual modalities
E N D
Speaking Faces Verification Kevin McTait Raphaël Blouet Gérard Chollet Silvia Colón Guido Aversano SecurePhone Workshop - 24/25 June 2004
Outline - Speaking faces verification problem - State of the art in speaking faces verification - Choice of system architecture - Fusion of audio and visual modalities - Initial results using BANCA database (Becars: voice only system) SecurePhone Workshop - 24/25 June 2004
Problem definition Detection and tracking of lips in the video sequence: Locate head/face in image frame Locate mouth/lips area (Region of Interest) Determine/calculate lip contours coordinates and intensity parameters (visual feature extraction) Other parameters: visible teeth, tongue jaw movement, eyebrows, cheeks etc… Modelling parameters Model deformation of lip (or other) parameters over time: HMMs, GMMs… Fusion of visual and acoustic parameters/models Calculate likelihood of model relative to client/world model in order to accept/reject Augment in-house speaker verification system (Becars) with visual parameters SecurePhone Workshop - 24/25 June 2004
Limitations Limited device (storage and CPU processing power) Subject variability (aging, beard, glasses…), pose, illumination Low complexity algorithms Subspace transforms, learning methods Image based approaches, hue colouration/chromaticity clues Model based approaches SecurePhone Workshop - 24/25 June 2004
Active Shape Models Identification: based on spatio-temporal analysis of video sequence Person represented by deformable parametric model of visible speech articulators (usually lips) with their temporal characteristics Active Shape Model consists of shape parameters (lip contours) and greyscale/colour intensity (for illumination) Model trained on training set using PCA to recover principal modes of deformation of the model Model used to track lips over time, model parameters recovered from lip tracking results Shape and intensity modelled by GMMs, temporal dependencies (state transition probabilities) by HMMs Verification: using a Viterbi algorithm, if estimation of likelihood of model generating the observed sequence of features corresponding to a client is above a threshold, then accept, else reject SecurePhone Workshop - 24/25 June 2004
Active Shape Models Robust detection, tracking & parameterisation of visual features Statistical, avoids use of constraints, thresholds, penalties Model only allowed to deform to shapes similar to those seen in training set (trained using PCA) Represent object by set of labelled points representing contours, height width, area etc. Model consists of 5 Bézier curves (B-spline functions), each defined as two end points PO and P1 and one control point P1 : P(t) = θ0(t)P0 + θ1(t)P1 + θ2(t)P2 points distribution model shape approximation SecurePhone Workshop - 24/25 June 2004
Spatio-temporal model • Visual observation of speaker: O = o1, o2…oT • Assumption: feature vectors follow normal distribution as in acoustic domain, modelled by GMMs • Assumption: temporal changes are piece-wise stationary and follow first order Markov process • Each state in HMM represents several consecutive feature vectors SecurePhone Workshop - 24/25 June 2004
ASM: Training SecurePhone Workshop - 24/25 June 2004
ASM: Tracking SecurePhone Workshop - 24/25 June 2004
ASM: Lip Tracking Examples SecurePhone Workshop - 24/25 June 2004
Image Based Approach Hue and saturation levels to find lip region (ROI) Eliminate outliers (red blobs) by constraints (geometric, gradient, saturation) Motion constraints: difference image (1d) pixelwise absolute difference between two adjacent frames a) greyscale image b) hue image c) binary hue/saturation threshholding c) accumulated difference image e) binary image after threshholding f) combined binary image c AND e Find largest connecting region SecurePhone Workshop - 24/25 June 2004
Image Based Approach (2) Derive lip dimensions using colour and edge information Random Markov field framework to combine two sources of info and segment lips from background Implementation close to completion SecurePhone Workshop - 24/25 June 2004
Other Approaches Deformable template/model/contour based: Geometric shapes, shape models, eigen vectors, appearance models, deform in order to minimise energy/distance function relating to template paramaters and image, template matching (correlation), best fit template, active shape models, active appearance models, model fitting problem Learning based approach: MLP, SVMs… Knowledge based approach: Subject rules or information to find and extract features, eye/nose detection symmetry Visual Motion analysis: Motion analysis techniques, motion cues, difference images after thresholding and filtering Optical flow, filter tracking (computationally expensive) Hue and saturation threshholding Intensity of ruddy areas, pb of removal of outliers Image subspace transforms: DCT, PCA, Discrete Wavelet, KLT (DWT + PCA analysis of ROI), FFT SecurePhone Workshop - 24/25 June 2004
Fusion of audio-visual information Instance of general classifier problem (bimodal classifier) 2 observation streams: audio + video providing info about hidden class labels Typically each observation stream used to train a single modality classifier Aim: combine both streams to produce bimodal classifier to recognise pertinent classes with higher level of accuracy 2 general types/levels of fusion: Feature fusion Decision fusion SecurePhone Workshop - 24/25 June 2004
Feature Fusion Feature fusion: HMM classifier, concatenated feature vector of audio and visual parameters – time synchronous features, possibly including upsampling) Generation process of feature vector Using single stream HMM with emission (class conditional observation) probabilities given by Gaussian distribution: SecurePhone Workshop - 24/25 June 2004
Decision Fusion State synchronous decision fusion Captures reliability of each stream HMM state level combine single modality HMM classifier outputs Class conditional log-likelihoods from the 2 classifiers linearly combined with appropriate weights Various level: state (phone, syllable, word…) multi-stream HMMs classifier, state emission probs: Product HMMs, factorial HMMs… Other classifiers (SVMs, Bayesian classifiers, MLP…) SecurePhone Workshop - 24/25 June 2004
Banca: results SecurePhone Workshop - 24/25 June 2004