350 likes | 365 Views
Learn about model-based lip feature extraction and tracking techniques for combining audio and visual cues in speech recognition systems. Explore energy function minimization, statistical shape models, and active shape models.
E N D
Model-Based Facial Feature Extraction Multimodal Interaction Dr. Mike Spann m.spann@bham.ac.uk http://www.eee.bham.ac.uk/spannm
Contents • Introduction • Lip feature extraction and tracking • Summary
Lip feature extraction and tracking • Lip feature tracking is an important in combining audio and visual cues for speech recognition systems • Typically the lip boundaries (inner/outer/both) are tracked and shape features passed to the speech recognition module • Previous approaches • Active contour model (snakes) • Energy function minimisation used to control contour shape (curvature) and local greylevel (colour) gradient • Can be dependant on weighting parameters which need to be tuned
Lip feature extraction and tracking • Typically an energy function E is defined in terms of the parameterised snake v(s)=(x(s),y(s)) where s is the distance along the snake: • The first two terms represent the snake’s internal energy and control it’s tension and rigidity • The third term attracts the snake to object boundaries with high greylevel gradient • Often an additional term is added for a ‘balloon’ snake to either inflate or deflate the snake
Lip feature extraction and tracking • More recent approaches to lip localisation and tracking have been model-based • A statistical shape model of the inner and outer lip contours can be built from training data • Landmarks on the contour form pointsets: • We need to align the pointsets and then build a statistical model using PCA
Lip feature extraction and tracking • Pointsets of lip feature landmarks must be normalized for translation, scale and rotation • We can use a simple iterative algorithm to align to the mean pointset
Lip feature extraction and tracking • PCA is based on the mean and covariance of the pointset vectors computed across the training set: • We then compute our shape model by solving the eigenvector/eigenvalue equation: • where Λis a diagonal matrix of eigenvalues :
Lip feature extraction and tracking • We can represent each landmark pointset x by a corresponding shape vector b • The set of bi’s across all of the pointsets in the database represents the ithmodeof variation of the original data • We can vary each bito get realistic versions of lip shapes • Typically for the itheigenvalue λi:
Lip feature extraction and tracking • An active shape model sample greylevels perpendicular to the lip contour and centred at the model points
Lip feature extraction and tracking • We sample the profiles perpendicular to each model point j • Training image i then gives us a vector of greylevels gij • We concatenate all these greylevel vectors to give us a global profile vector hi • We build a statistical model out of these profile vectors to enable the main modes of variation of the profiles about the model boundaries to be computed
Lip feature extraction and tracking • The weight vectors bhcan be used as a parameter in a cost function to determine how well the actual profile fits the model
Lip feature extraction and tracking • The greylevels between profile vectors can be interpolated to visualise the greylevel models • Some smoothing using a median filter helps remove any artefacts of the interpolation • We can visualise several modes corresponding to the first few eigenvectors • The corresponding components of the weight vector bh can be varied according to: • For example we can set bhi to ±2√λi for i=1,2,3
Lip feature extraction and tracking • Mode 1 • Global illumination differences • Mode 2 • Lower/Upper lip intensity difference • Mode 3 • Skin/lip contrast differences • Higher modes • Illumination variations, visibility of teeth and tongue etc
Lip feature extraction and tracking • In order to apply an ASM search algorithm, a coarse estimate of the region of interest containing the lips region is found • Can be input interactively or computed automatically using segmentation or edge-based feature extraction algorithm • Provides an estimate of the scale of the lips • Limits the search area
Lip feature extraction and tracking • In order to use the greylevel and shape models in a search algorithm we can use the greylevel model to best fit the model greylevel profile to the current greylevel profile • Shape and pose parameters can then be updated • We need a cost function which describes the fit between the model greylevel profile and the profile measured in the image at the current model position • Several statistical approaches possible • Maximizing the probability assuming Gaussian distributions • Minimizing the mean square error between the profiles
Sample profile h Current model position
Lip feature extraction and tracking • We can define a error function E defining the mismatch between the actual profile h measured at the current position estimate and our model profile hm: • Substituting for hm : • Typically hm would comprise only the first few modes of variation
Lip feature extraction and tracking • The model is initialized with the mean shape computed over aligned shapes in the training set • Our goal is to minimize our energy function E in terms of translation vectors tx and ty, a scale parameter s and a rotation angle θalong with the profile parameter vector :
Lip feature extraction and tracking • Optimization is carried out by perturbing individual parameters and evaluating their effects on the energy function E • Typically only a few (typically 10-20) shape modes are used in the search to ease the computational burden • Perturbations in bi are limited to: • For a given position of the model landmarks, the profile h is sampled and bh computed according to:
Lip feature extraction and tracking • We can devise an iterative algorithm to update the pose and shape parameters sequentially based on our error measure • The algorithm alternates between ‘model space’ and ‘image space’ • The object boundary in model space is defined by the shape parameters • We can use the greylevel or colour profile information to measure the error in image space • Conversion between the two spaces is done via the pose parameters
Lip feature extraction and tracking Model space - b Image space - bh
Lip feature extraction and tracking • Initialize the shape parameters b to zero and image points y • 2. Generate the model point positions: 3. Find the pose parameters tx,ty, s, θ to best fit the model points to the image points y • Project the model points into the image frame • x->T(x), compute the image profile vector h and at each projected model point, search normal to the model boundary and find the image points y’ which minimize E to produce new image profile vector h’
Lip feature extraction and tracking 6. Project the image points y’ into the model coordinate frame by inverting the transformation T 7. Update the model parameters 8. If not converged y->y’. Go to step 2
Lip feature extraction and tracking Model point Nearest image point to model point Image boundary
Lip feature extraction and tracking • Its easier to track the outer lips than the inner ones • More constant greylevel profile • Easier to model for example with application to active shape modelling • But, less appropriate for lip gesture recognition and speech recognition algorithms • Often using a full appearance model rather than just a shape model gives better speech recognition performance • For example the teeth and tongue appearance give clues to particular types of vocal sounds
Lip feature extraction and tracking • Results of off centre initialization of ASM using local greylevel profiles after 5, 10, 20, iterations
Lip feature extraction and tracking • Results using ASM search with local greylevel profiles
Lip feature extraction and tracking • Demo • http://www.ee.surrey.ac.uk/Projects/M2VTS/experiments/lip_tracking/index.html
Summary • We have looked at a shape model and a model describing greylevel or colour variation local to the shape model landmark positions can be used for finding the lip contour location in face images • We have described an iterative model-based search algorithm for lip contour location • We have shown lip tracking results based on this algorithm