370 likes | 557 Views
Decoding Human Face Processing. Ankit Awasthi Prof. Harish Karnick. Motivation. One of the most important goals of Computer Vision researchers is to come up with a algorithm which can process face images and classify into different categories (based on gender, emotions, identity etc.)
E N D
Decoding Human Face Processing AnkitAwasthi Prof. Harish Karnick
Motivation • One of the most important goals of Computer Vision researchers is to come up with a algorithm which can process face images and classify into different categories (based on gender, emotions, identity etc.) • Human are extremely good at these tasks • In order to match human performace and eventually beat it, it is imperative that we understand how humans do it
Motivation • Moreover, similar cognitive processes might be involved in processing of other kinds of visual data or even data from other modalities • Discovery of computational basis of face processing might be a good indication of generic cognitive structures
Where does our work fit in?? • A large number of neurological and psychological experimental findings • Implications for computer vision algorithms • Closing the loop
Neural Networks (~1985) Compare outputs with correct answer to get error signal Back-propagate error signal to get derivatives for learning outputs hidden layers input vector
Why Deep Learning?? • Brains have a deep architecture • Humans organize their ideas hierarchically, through composition of simpler ideas • Insufficiently deep architectures can be exponentially inefficient • Deep architectures facilitate feature and sub-feature sharing
Restricted Boltzmann Machines (RBM) • We restrict the connectivity to make learning easier. • Only one layer of hidden units. • No connections between hidden units. • Energy of a joint configuration is defined as • (for binary visible units) • (for real visible units) Hidden(h) j i Visible(v)
Sparse DBNs(Lee at. al. 2007) • In order to have a sparse hidden layer, the average activation of a hidden unit over the training is constrained to a certain small quantity • The optimization problem in the learning algorithm would look like
Important observations about DBNs • We found that in our experiments that • Fine tuning was important only for construction of autoencoder • The final softmax layer can be learned on top of the learned with marginal loss in accuracy • Fine tuning the autoencoder is important
Neural Underpinnings(Sinha et. al., 2006) • The human visual system appears to devote specialized resources for face perception • Latency of responses to faces in infero-temporal cortex is about 120 ms, suggesting a largely feed-forward computation • Facial identity and emotion might be processed separately • One of the reasons, we restricted ourselves to emotion and gender classification
Experiments and Dataset • Gender and Emotion Recognition (happy,neutral) • Training images • 300, 50x50 images used • Test images • 98,50x50 images used
Results on Normal images • Same network architecture used for all experiments (3000->1000->500->200->100) • Gender Recognition • 94% • Emotion Recognition • 93%
Low vs High Spatial Frequency • A number of contradictory results • General Consensus • Low spatial frequency is more important than higher spatial frequencies • Hints at the importance of configural information • High frequency information by itself does not lead to good performance • How to reconcile this with observed recognizability of line drawings in everyday experience • Spatial frequency employed for emotions is higher than that employed for gender classification (Deruelle and Fagot,2004)
Experiments • We cut off all spatial frequencies above 8cycles per face • Two cases each in gender and emotion recognition • A model trained on ‘normal’ images is tested on low spatial frequency images • A model trained on low spatial frequency images is tested on low spatial frequency images
Results • Gender Recognition • Model trained on ‘normal’ images ~ 89% • Model trained on LSF images ~ 91% • Emotion Recognition • Model trained on ‘normal’ images ~ 87% • Model trained on LSF images ~ 90.5%
Discussion • The decrease in the accuracy is not much considering the significant reduction in the amount of information • Implies low spatial frequency information can be used to classify a majority of images • Tests with different spatial frequencies need to be done to reach a conclusive answer • Importance of HSF is not apparent here because of the simplicity of the task • In some other experiments where we looked at only HSF images, the results weren’t good!
Component and Configural Information • Facial features are processed holistically in recognition (Sinha et. al,2006) and in emotion recognition (Durand et. al., 2007) • The configural information affects how individual features are processed • On the other hand, there is evidence that we process face images by matching parts • Thatcher illusion Configural information affects individual features are processed
Experiments • Two kinds experiments • Models trained on ‘normal’ images tested on new images • Same set of training and test images
Results (Gender Classification) • Models trained on ‘normal’ images ~ 91% ~80% ~ 70% ~ random!!
Results(Gender Classification) • Same training and test images ~ 93% ~ 85% ~ 79%
Results (Emotion Classification) • Models trained on ‘normal’ images ~ 87% ~81% ~ 87% ~ random!!
Results(Emotion Classification) • Same training and test images ~ 92% ~ 84% ~ 82%
Agreement with Human Performance • Preliminary results show that humans are • Perfect in case of normal images we are using • Error prone when the parts are removed (3 out of 20 images on an average) • Accuracy depends a lot upon the time of exposure • Proper timed experiments are expected yield results much similar to the algorithm
Discussion • The importance of important features (eyes,mouth) evident • Eyes/eyebrows are important for gender recognition • When trained on ‘normal’ images the algorithm learns features corresponding to important parts • In absence of these features the algorithm learns to extract other features to increase
Inversion Effect • One of the first findings which hinted at a dedicated face processing pathway • Another indicator configuralprccessing of face images • Inverted images take significantly longer to process
Experiments and Results • Models trained on ‘normal’ images • The results are “random”!! • Training and testing on inverted is same doing it for ‘normal’ images • Results show that the face image processing is not part based
Experiment and Results • Models trained on ‘normal’ images • Random for both tasks!! • Same training and test images • Gender: 92% • Emotion: 91%
High Level Features • Only few connections to the previous layer have their weights either too high or too low • Some of the largest weighted connections are used for linear combination • Overlooks the non-linearity in the network from one layer to the other
Natural Extensions • More exhaustive set of experiments need to be done to verify our preliminary observations • It would be interesting to compare other models with Deep networks • Some of the problems or inconsistencies are due to lack of translation invariant features • Best solution is to use a Convolutional Model • Natural regularizer • Translational invariance • Biologically plausible
Conclusion • We have done preliminary investigations with respect to various phenomenon • Observed results certainly hint at the cognitive relevance of the model
References • Georey E. Hinton, Yee-WhyeTeh and Simon Osindero, A Fast Learning Algorithm forDeep Belief Nets. Neural Computation, pages 1527-1554, Volume 18, 2006. • Georey E. Hinton (2010). A Practical Guide to Training Restricted Boltzmann Machines,TechnicalReport,Volume 1 • DumitruErhan, YoshuaBengio, Aaron Courville, and Pascal Vincent (2010). Visualizing Higher-Layer Features of a Deep Network,Technical Report 1341 • Honglak Lee, Roger Grosse,RajeshRanganath, Andrew Y. Ng. Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations,ICML 2009 • Geoffrey E. Hinton Learning multiple layers of representation,Trends in Cognitive Sciences Vol.11 No.10 ,2006 • Honglak Lee, ChaitanyaEkanadham, Andrew Y. Ng, Sparse deep belief net model for visual area V2, NIPS,2007
References • Olshausen BA, Field DJ (1997) Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1? Vision Research, 37: 3311-3325. • Karine Durand , Mathieu Gallay, AlixSeigneuric ,FabriceRobichon , Jean-Yves Baudouin ,The development of facial emotion recognition:The role of configural information, Jornal of child Psychology, 2007 • PrawalSinha, Benjamin Balas, Yuri Ostrovsky, Richard Russell,Face Recognition by Humans: Nineteen results all Computer Vision Researchers should know, • Christian Wallraven, Adrian Schwaninger,Heinrich H. Bulltoff, Learning from humans, Computational modeling of face recognition, Computation in Neural Systems • Christine Duerelle and Joel Faggot ,Categorizing facial indentities,emotions, and genders : Attention to high and low-spatial frequencies by children and adults