Attentive People Finding

Attentive People Finding James Elder Centre for Vision Research York University Toronto, Canada Joint work with: Simon Prince Bob Hou

Collaborative Project: “Monitoring Changes to Urban Environments with a Network of Sensors” Funding: Canadian Agency called GEOIDE (Geomatics for Informed Decisions) "This ‘network of networks’ brings together the skills, technology and people from different communities of practice, in order to develop and consolidate the Canadian competences in geomatics." Research Context

Monitoring Changes to Urban Environments "This project will study visual detection and interpretation of changes to urban environments using continuous and non-continuous sensing from a multiplicity of diverse sensors using networks of video cameras, augmented with high-resolution satellite imagery. It will also investigate the problem of how such information can be integrated and managed within a computer, leading to the development of a prototype information system for monitoring urban environments." What is our project?

University Principal Investigators: David Clausi, Waterloo Geoffrey Edwards, Laval James Elder, York Frank Ferrie, McGill Jim Little, UBC Main Industry Partners CAE Genetec Aimetis Project Team

April 2005 – March 2009 Timeframe

1. Establishment of urban test facilities involving networks of multi-sensor wireless cameras with associated satellite data and development of intercalibration software (Elder, Ferrie, Little) 2. Development of algorithms for fusing offline satellite data with streaming video from terrestrial sensors for the construction of more complete 3D urban models (Clausi). 3. Development of algorithms for inferring approximate intrinsic images from monocular video (ordinal depth maps, reflectance maps, …). (Elder, Ferrie, Little) 4. Development of algorithms for identifying and modeling typical dynamic events (e.g. pedestrian and automobile traffic, changes in climate, air quality, seasonal changes) and detecting unusual events. (Elder, Ferrie, Little) 5. Development of algorithms for deriving and updating navigational maps based upon derived models. (Edwards) 6. Development of integrated demonstration system. (Ferrie) Objectives

Disaster management (e.g., earthquakes) Traffic monitoring (e.g., automobile, trucking, pedestrian) Security (e.g., people tracking, activity and identity recognition) Urban planning (e.g., 3D dynamic scene visualization) Environmental monitoring (e.g., air quality) Possible Application Areas

FOVEAL IMAGE TILT PAN WIDE-FIELD IMAGE Pre-Attentive and Attentive Sensing (with S. Prince, Y. Hou, M. Sizinitsev, E. Olevskey)

Homographic fusion of attentive and pre-attentive streams

Wide-Field Body Detection Min: 15x2 pixels Max: 98x78 pixels Median: 52x14 pixels

Wide-Field Face Detection Min: 2x2 pixels Max: 34x31 pixels Median: 6x6 pixels

Detecting people in realistic environments

Biological vision?

Motion scaling From Johnston & Wright, 1986

Biological Motion From Ikeda, Blake & Watanabe, 2005

1000 ms 59 ms 506 ms Until Response Structural Coherence (with L. Velisavljevic) Psychophysical Method

Image Conditions Scrambled Coherent Colour Monochrome

82 Data Model 78 74 70 66 62 58 Colour Colour BW BW Coherent Incoherent Coherent Incoherent Results % Correct

90 80 Percent Correct 70 60 Unscrambled Scrambled 50 3 8 13 18 Mean Distance from Fixation (º) Spatial Coherence Colour Monochromatic

Pre-Attentive (Peripheral) Vision: Motion discrimination Colour discrimination Biological motion Contour integration Coherent structure Summary

Motion region likelihood ratio raw pixel pixel posterior region response pixel model spatial integrator region model Foreground region likelihood ratio system posterior raw pixel pixel posterior region response system priors pixel model spatial integrator region model X Skin region likelihood ratio pixel posterior region response raw pixel pixel model spatial integrator region model Preattentive System Design

confirmed face location mean body indicator motion kernel spatial prior gaze command prior posterior random sampler gaze control high-resolution face detection non-max suppression likelihood Attentive sensor motion kernel Priors as Attentive Feedback

1 Motion 0.5 Original frame 0 Foreground 1 Skin 0.5 0 Skin 1 0.5 0 Pixel Posteriors Pixel Posteriors

Spatial Integration

0.86 0.84 0.82 0.8 Area under ROC Curve 0.78 0.76 Motion 0.74 Foreground 0.72 Skin 0.7 -1 0 1 10 10 10 g Exponent, Spatial Integration

Motion Region Log Likelihood Ratio 4 2 Joint Region Log Likelihood Ratio 0 4 -2 2 Foreground Region Log Likelihood Ratio -4 4 0 2 -2 0 -4 -2 -4 Skin Region Log Likelihood Ratio 4 2 0 -2 -4 Spatial Integration

1 0.8 0.6 p(Hit) Foreground 13 x 20 Skin 4 x 5 0.4 Motion 20 x 20 Combined 0.2 Xiong & Jaynes 0 0 0.2 0.4 0.6 0.8 1 p(False Positive) Combining Detectors • System evaluation on distinct test database: • 74% of fixations capture human heads

System evaluation on distinct test database: 74% of fixations capture human heads 83% of people are fixated at least once Performance

Automatically Confirmed High-Resolution Faces

3D POSE PROBLEM Capture training and test database Horizontal pose (known) varies over 180 degrees. Pose for each image known precisely. Points on each face identified Image regions extracted Features are weighted sums of pixels in region

An Alternate Approach: 2D to 3D (with VisionSphere Technologies)

Simon Prince

Realistic environments and behaviour  hard problem. Humans: primitive mechanisms are preserved in periphery, more complex mechanisms are not. Our approach: probabilistic combination of simple, weak cues Ongoing work: attentive feedback Attentive People Finding

Colour Scaling From Rovamo & Iivanainen, 1991

Contour Integration From Hess & Dakin, 1999

Interactive Attentive Sensing Needed: Fast Saccadic Programming Algorithms!

0.86 0.84 0.82 0.8 Area under ROC Curve 0.78 0.76 Motion 0.74 Foreground 0.72 Skin 0.7 -1 0 1 10 10 10 g Exponent, Spatial Integration

3D Hugh

Sal Khan (VisionSphere)

A supervised method to make a feature set more invariant to a known nuisance parameter Fast No knowledge of faces No knowledge of 3d transformations Slower Uses lot s of domain specific knowledge Better Results SUMMARY EIGEN-LIGHTFIELDS < INVARIANCE << 3D MODEL Gross, Matthews, Baker Prince, Elder Blanz et al.

TO TRAIN: ESTIMATE MEAN AND COV OF MANIFOLD A FUNCTION OF DISTRACTOR VARIABLE ALTERNATELY ESTIMATE: INVARIANT VECTORS Ci TRANSFORMATIONS F1..n TO CALCULATE INVARIANT VECTORS: ESTIMATE NUISANCE VALUE, v TRANSFORM BY APPROPRIATE Fv Algorithm Summary

Attentive Snapshots

Problem: Image variation due to nuisance parameters such as pose change is greater than variation due to identity. This is reflected in most “features” PROBLEM STATEMENT Feature Space

X1 C f1,q1 ………………. ………………. ………………. NUISANCE PARAMETERS + CONVENTIONAL FEATURE VECTOR INVARIANT VECTOR f2,q2 X2 GOAL: Decompose Conventional Feature Vector to Invariant Feature + Nuisance Parameter

TEST IMAGES – angle unknown ? PROBE IMAGE – angle unknown TRAINING IMAGES – angle known, several images of each face present TOY DATA SET – IN PLANE ORIENTATION Choice of features: – first few EIGENVECTORS

Increasing q THE FIRST TWO FEATURE DIMENSIONS X2 X1

ESTIMATE NUISANCE PARAMETER X2 X1

Attentive People Finding