710 likes | 724 Views
This paper presents a method for detecting and tracking people in stereo images using adaptive plan-view statistical templates. The method provides accurate physical locations in real units and is suitable for use in arbitrary environments. The paper also discusses the advantages of using plan-view images over real overhead camera views.
E N D
Stereo Person Tracking with Adaptive Plan-View Statistical Templates Michael Harville HP Laboratories Palo Alto, CA, United States
Person Detection and Tracking: Motivation • Fundamental technology enabling many apps in pervasive computing and intelligent environments • Automatic personal diary / memory aid • Computer/phone/speakers/lights moving with person • HCI/PUI • Usually, need to find person before analyzing face, gestures, etc. • Activity-monitoring and surveillance • Security • Shopper behavior in retail stores • Video coding, indexing, compression • Special treatment for the people in the scene
Why Vision? • No special equipment, clothing, or behaviors required of user • People are passive participants, not active drivers. No special effort needed. • Works on everyone, not just the “special” ones • Video is a rich (the richest?) source of information for tasks beyond person tracking • Provides information not just for detection and tracking, but also for identification, activity analysis, mood, etc. • How many active sensors can the world stand?
Goals of Method • Detect people and track their locations in space • Provide physical locations in real units (e.g. meters) • Handle multiple people, complex behavior • Arbitrary environments • Compact tracking unit, easy setup • Real-time
Key Contributions • New substrate of image statistics on which to do tracking • Transformations and refinement of raw, dense “camera-view” depth images to “plan-view” • Suitable for use with many different tracking techniques • Tracking framework based on adaptive templates • Better use of plan-view features • Can be used with other plan-view image substrates • Methods for avoiding typical adaptive template problems
Outline • Introduction and Motivation for Plan-View Maps • Plan-View Map Construction • Tracking Method • Implementation and Results
Outline • Introduction and Motivation for Plan-View Maps • Plan-View Map Construction • Tracking Method • Implementation and Results
Input: Color and Depth from Stereo Unit Spatially- and temporally-registered color+depth
Real-time Stereo Becoming Practical • Tyzx ( www.tyzx.com; from Interval ) • ASIC costs <$5 in volume, uses little power • Point Grey Digiclops ( www.ptgrey.com ) • SRI Small Vision System ( www.videredesign.com ) • 3DV Systems Zcam ( www.3dvsystems.com ) • Canesta ( www.canesta.com ) • Sarnoff Acadia vision processor
Tracking in “Camera View” with Depth • Depth helpful in many ways: • Powerful cue for foreground segmentation • Gives physical size and shape information • Allows for better occlusion detection and handling • Provides new types of features for tracking • Provides third dimension of prediction in tracking • Several recent papers have illustrated this: • Eveland & Konolige (1997): depth only, single person • Darrell et. al. (1998): color+depth, multi-person • Haritaoglu et. al. (1998): W4S • Beymer & Konolige (1999); Krumm et. al. (2000): multi-camera
Problem: Depth Images Very Noisy! • Unreliable depth in areas of low visual texture • Poor depth contour accuracy • For static scene: std. dev. of depth at a pixel typically 10% of mean or more
A Solution: Use Depth to Render New Views Depth image coordinate and value (u,v,D) Camera calibration params 3D scene location (X, Y, Z) • Construct 3D point cloud of “interesting” part of image (e.g. foreground, people). • Render images of statistics of this point cloud, from new view points and with arbitrary projection models.
“Plan-View” Statistical Images Virtual overhead view, with orthographic projection Easier, more reliable separation of people
“Plan-View” Statistical Images color Stereo camera depth
“Plan-View” Statistical Images color bg model Stereo camera foreground depth
“Plan-View” Statistical Images Use depth + camera calibration to do 3D back-projection color bg model Stereo camera foreground depth 3D point cloud
“Plan-View” Statistical Images Quantize space into 3D vertical bins Use depth + camera calibration to do 3D back-projection color bg model Stereo camera foreground depth 3D point cloud
“Plan-View” Statistical Images Quantize space into 3D vertical bins Use depth + camera calibration to do 3D back-projection color bg model Stereo camera Plan-view projection: image of one statistic per vertical bin foreground depth 3D point cloud
Why Not Just Use a Real Overhead Camera? • Sometimes, there is no “ceiling”! • For example, outdoors • Cannot see faces easily • desirable in many applications that employ person tracking • Also…
Advantages Over a Real Overhead Camera • Real camera perspective projection • Along image periphery (most of image), projection axis far from parallel to ground normal; much inter-person occlusion
Advantages Over a Real Overhead Camera • Real camera perspective projection • Along image periphery (most of image), projection axis far from parallel to ground normal; much inter-person occlusion Orthographic projection better
Advantages Over a Real Overhead Camera • Overhead camera typically sacrifices on ground coverage (particularly when ceiling is low)
Advantages Over a Real Overhead Camera • Overhead camera typically sacrifices on ground coverage (particularly when ceiling is low)
Outline • Introduction and Motivation for Plan-View Maps • Plan-View Map Construction • Tracking Method • Implementation and Results
What (Vertical Bin) Statistic to Image? Count of 3D points in each bin plan-view “occupancy” or “density” maps
Scaling Occupancy to Get Surface Area • Scale increments to occupancy map by Z2/fu fv • Occupancy map now represents object surface area visible to camera, in real units (e.g. in cm2) • Occupancy map representations now invariant to distance from camera (except for noise) f Camera center of projection Imager pixel Z Area subtended by pixel in real world at a distance Z from camera
Plan-View Occupancy Maps • Applied to person tracking by • Interval researchers (1999) - unpublished • Beymer (2000) • Darrell et. al. (2001) • Advantages • Good indicator of where people are likely to be • Disadvantages • Discards shape information in dimension normal to ground • Occupancy statistical representations of people are very sensitive to partial occlusions
An Alternative Statistic: Maximum Height Z-coordinate (height above ground) of highest point in each bin plan-view height maps
Height Map Computation Notes • Can be done in a single pass through depth image data • Ignore data at heights above some Hmax that is reasonable for people • Scene ground need not be planar • Add in height offset map Ho constructed from background model depth
Plan-View Height Maps • Not previously applied to person-tracking • But used in other contexts: path-planning for Mars rover, military target recognition • Advantages • Preserves about as much 3D shape as possible in a 2D image • Fast computation (e.g. compared to 90th percentile height) • For high camera mounts and typical environments, height map statistical representations of people are less affected by partial occlusions. • Disadvantages • Very sensitive to depth noise • Easy to confuse person upper body with small foreground objects placed at the same height
Can We Combine Them, Get Best of Both? Idea: Restrict use of height data to map locations where we believe something “significant” is present, as indicated by the local occupancy data. + ?
Plan-View Map Refinement smooth threshold Oraw Osm Othresh mask smooth Hraw Hsm Hmasked
Height Map: Before and After Raw height map Masked, smoothed height map
Example Plan-View Map Data Oraw Hraw Othresh Hmasked
Statistical Substrate for Tracking smooth threshold Oraw Osm Othresh mask smooth Hraw Hsm Hmasked
Statistical Substrate for Tracking smooth threshold threshold Oraw Osm Othresh mask smooth Hraw Hsm Hmasked
Statistical Substrate for Tracking smooth threshold threshold Oraw Osm Othresh mask smooth Hraw Hsm Hmasked
Statistical Substrate for Tracking Object surface area (visible to camera) smooth threshold Oraw Osm Othresh mask smooth Object shape, as viewed from above Hraw Hsm Hmasked
Outline • Introduction and Motivation for Plan-View Maps • Plan-View Map Construction • Tracking Method • Implementation and Results
How to Track People in this Feature Space? Two important choices to make: • Person model • Tracking method
How to Track People in this Feature Space? Two important choices to make: • Person model • Tracking method
Options for a Person Model • “Blob” / connected component • Not very descriptive; pray for good person separation and/or an excellent tracking framework. • Good ol’ Gaussian • Tried and true, lots of techniques and algorithms based on it from which to draw ideas • But not a very good use of our feature data • Fixed template(s) • For instance, use common shape(s) of head+shoulders in a height map • People of shapes or in poses inconsistent with template(s) will not be tracked well
Our Person Model: Adaptive Templates • Use patches of the plan-view statistical image data itself as the model TH (height template)
Our Person Model: Adaptive Templates • Use patches of the plan-view statistical image data itself as the model TH (height template) TO (occupancy template)
Adaptive Templates (continued) • Allow model to evolve as person changes pose or becomes (dis)occluded use image data • Still need initialization criterion to decide that a patch of plan-view image data is a person • Currently: • significant occupancy (at least half a person’s worth) • max height above some reasonable minimum for people • not a completely static object (according to inter-frame diffs) • Future: Compare plan-view data to “person-like” templates learned from training
How to Track People in this Feature Space? Two important choices to make: • Person model • Tracking method
Tracking Method: Simple Kalman Filter-Based Approach Prediction Measurement • Constant velocity • No template change • Find image location that minimizes match energy • Measurements are data from match location State Fast frame-rate desirable! Update Do loop for each person individually on each frame, in order of tracking confidence (equal to inverse of Kalman variance in location estimate) • Standard Kalman update for position, velocity • Update templates directly from image data (faster)
Match Energy Minimization • Match energy: for ith person • Search in restricted image area • centered around predicted location • size determined by positional uncertainty Surface area difference Shape difference Distance from predicted location Do not match multiple people to the same place
“Lost” People • Set a maximum on tracking match energy • If maximum exceeded, report Kalman prediction as person location • Put person on “lost people” list • Only use prediction in absence of data for limited time • Attempt to match “new” and “lost” people • For now, just check temporal and spatial nearness • Future: compare shape and color features