Stereo Person Tracking with Adaptive Plan-View Statistical Templates

Stereo Person Tracking with Adaptive Plan-View Statistical Templates Michael Harville HP Laboratories Palo Alto, CA, United States

Person Detection and Tracking: Motivation • Fundamental technology enabling many apps in pervasive computing and intelligent environments • Automatic personal diary / memory aid • Computer/phone/speakers/lights moving with person • HCI/PUI • Usually, need to find person before analyzing face, gestures, etc. • Activity-monitoring and surveillance • Security • Shopper behavior in retail stores • Video coding, indexing, compression • Special treatment for the people in the scene

Why Vision? • No special equipment, clothing, or behaviors required of user • People are passive participants, not active drivers. No special effort needed. • Works on everyone, not just the “special” ones • Video is a rich (the richest?) source of information for tasks beyond person tracking • Provides information not just for detection and tracking, but also for identification, activity analysis, mood, etc. • How many active sensors can the world stand?

Goals of Method • Detect people and track their locations in space • Provide physical locations in real units (e.g. meters) • Handle multiple people, complex behavior • Arbitrary environments • Compact tracking unit, easy setup • Real-time

Key Contributions • New substrate of image statistics on which to do tracking • Transformations and refinement of raw, dense “camera-view” depth images to “plan-view” • Suitable for use with many different tracking techniques • Tracking framework based on adaptive templates • Better use of plan-view features • Can be used with other plan-view image substrates • Methods for avoiding typical adaptive template problems

Outline • Introduction and Motivation for Plan-View Maps • Plan-View Map Construction • Tracking Method • Implementation and Results

Input: Color and Depth from Stereo Unit Spatially- and temporally-registered color+depth

Real-time Stereo Becoming Practical • Tyzx ( www.tyzx.com; from Interval ) • ASIC costs <$5 in volume, uses little power • Point Grey Digiclops ( www.ptgrey.com ) • SRI Small Vision System ( www.videredesign.com ) • 3DV Systems Zcam ( www.3dvsystems.com ) • Canesta ( www.canesta.com ) • Sarnoff Acadia vision processor

Tracking in “Camera View” with Depth • Depth helpful in many ways: • Powerful cue for foreground segmentation • Gives physical size and shape information • Allows for better occlusion detection and handling • Provides new types of features for tracking • Provides third dimension of prediction in tracking • Several recent papers have illustrated this: • Eveland & Konolige (1997): depth only, single person • Darrell et. al. (1998): color+depth, multi-person • Haritaoglu et. al. (1998): W4S • Beymer & Konolige (1999); Krumm et. al. (2000): multi-camera

Problem: Depth Images Very Noisy! • Unreliable depth in areas of low visual texture • Poor depth contour accuracy • For static scene: std. dev. of depth at a pixel typically 10% of mean or more

A Solution: Use Depth to Render New Views Depth image coordinate and value (u,v,D) Camera calibration params 3D scene location (X, Y, Z) • Construct 3D point cloud of “interesting” part of image (e.g. foreground, people). • Render images of statistics of this point cloud, from new view points and with arbitrary projection models.

“Plan-View” Statistical Images Virtual overhead view, with orthographic projection Easier, more reliable separation of people

“Plan-View” Statistical Images color Stereo camera depth

“Plan-View” Statistical Images color bg model Stereo camera foreground depth

“Plan-View” Statistical Images Use depth + camera calibration to do 3D back-projection color bg model Stereo camera foreground depth 3D point cloud

“Plan-View” Statistical Images Quantize space into 3D vertical bins Use depth + camera calibration to do 3D back-projection color bg model Stereo camera foreground depth 3D point cloud

“Plan-View” Statistical Images Quantize space into 3D vertical bins Use depth + camera calibration to do 3D back-projection color bg model Stereo camera Plan-view projection: image of one statistic per vertical bin foreground depth 3D point cloud

Why Not Just Use a Real Overhead Camera? • Sometimes, there is no “ceiling”! • For example, outdoors • Cannot see faces easily • desirable in many applications that employ person tracking • Also…

Advantages Over a Real Overhead Camera • Real camera perspective projection • Along image periphery (most of image), projection axis far from parallel to ground normal; much inter-person occlusion

Advantages Over a Real Overhead Camera • Real camera perspective projection • Along image periphery (most of image), projection axis far from parallel to ground normal; much inter-person occlusion Orthographic projection better

Advantages Over a Real Overhead Camera • Overhead camera typically sacrifices on ground coverage (particularly when ceiling is low)

What (Vertical Bin) Statistic to Image? Count of 3D points in each bin plan-view “occupancy” or “density” maps

Scaling Occupancy to Get Surface Area • Scale increments to occupancy map by Z2/fu fv • Occupancy map now represents object surface area visible to camera, in real units (e.g. in cm2) • Occupancy map representations now invariant to distance from camera (except for noise) f Camera center of projection Imager pixel Z Area subtended by pixel in real world at a distance Z from camera

Plan-View Occupancy Maps • Applied to person tracking by • Interval researchers (1999) - unpublished • Beymer (2000) • Darrell et. al. (2001) • Advantages • Good indicator of where people are likely to be • Disadvantages • Discards shape information in dimension normal to ground • Occupancy statistical representations of people are very sensitive to partial occlusions

An Alternative Statistic: Maximum Height Z-coordinate (height above ground) of highest point in each bin plan-view height maps

Height Map Computation Notes • Can be done in a single pass through depth image data • Ignore data at heights above some Hmax that is reasonable for people • Scene ground need not be planar • Add in height offset map Ho constructed from background model depth

Plan-View Height Maps • Not previously applied to person-tracking • But used in other contexts: path-planning for Mars rover, military target recognition • Advantages • Preserves about as much 3D shape as possible in a 2D image • Fast computation (e.g. compared to 90th percentile height) • For high camera mounts and typical environments, height map statistical representations of people are less affected by partial occlusions. • Disadvantages • Very sensitive to depth noise • Easy to confuse person upper body with small foreground objects placed at the same height

Example Height Map Data

Can We Combine Them, Get Best of Both? Idea: Restrict use of height data to map locations where we believe something “significant” is present, as indicated by the local occupancy data. + ?

Plan-View Map Refinement smooth threshold Oraw Osm Othresh mask smooth Hraw Hsm Hmasked

Height Map: Before and After Raw height map Masked, smoothed height map

Example Plan-View Map Data Oraw Hraw Othresh Hmasked

Statistical Substrate for Tracking smooth threshold Oraw Osm Othresh mask smooth Hraw Hsm Hmasked

Statistical Substrate for Tracking smooth threshold threshold Oraw Osm Othresh mask smooth Hraw Hsm Hmasked

Statistical Substrate for Tracking Object surface area (visible to camera) smooth threshold Oraw Osm Othresh mask smooth Object shape, as viewed from above Hraw Hsm Hmasked

How to Track People in this Feature Space? Two important choices to make: • Person model • Tracking method

Options for a Person Model • “Blob” / connected component • Not very descriptive; pray for good person separation and/or an excellent tracking framework. • Good ol’ Gaussian • Tried and true, lots of techniques and algorithms based on it from which to draw ideas • But not a very good use of our feature data • Fixed template(s) • For instance, use common shape(s) of head+shoulders in a height map • People of shapes or in poses inconsistent with template(s) will not be tracked well

Our Person Model: Adaptive Templates • Use patches of the plan-view statistical image data itself as the model TH (height template)

Our Person Model: Adaptive Templates • Use patches of the plan-view statistical image data itself as the model TH (height template) TO (occupancy template)

Adaptive Templates (continued) • Allow model to evolve as person changes pose or becomes (dis)occluded use image data • Still need initialization criterion to decide that a patch of plan-view image data is a person • Currently: • significant occupancy (at least half a person’s worth) • max height above some reasonable minimum for people • not a completely static object (according to inter-frame diffs) • Future: Compare plan-view data to “person-like” templates learned from training

How to Track People in this Feature Space? Two important choices to make: • Person model • Tracking method

Tracking Method: Simple Kalman Filter-Based Approach Prediction Measurement • Constant velocity • No template change • Find image location that minimizes match energy • Measurements are data from match location State Fast frame-rate desirable! Update Do loop for each person individually on each frame, in order of tracking confidence (equal to inverse of Kalman variance in location estimate) • Standard Kalman update for position, velocity • Update templates directly from image data (faster)

Match Energy Minimization • Match energy: for ith person • Search in restricted image area • centered around predicted location • size determined by positional uncertainty Surface area difference Shape difference Distance from predicted location Do not match multiple people to the same place

“Lost” People • Set a maximum on tracking match energy • If maximum exceeded, report Kalman prediction as person location • Put person on “lost people” list • Only use prediction in absence of data for limited time • Attempt to match “new” and “lost” people • For now, just check temporal and spatial nearness • Future: compare shape and color features

Stereo Person Tracking with Adaptive Plan-View Statistical Templates

Stereo Person Tracking with Adaptive Plan-View Statistical Templates

Presentation Transcript

Multi-Group Tracking with Adaptive Target Model

Stereo vision and two-view geometry

Britney’s Person-Centered Plan

DAVANet : Stereo Deblurring with View Aggregation

Two-View Stereo

Multi-View Stereo for Community Photo Collections

Multi-view stereo

STATISTICAL DATA TRACKING EXPLORER

Local Stereo Matching Using Adaptive Local Segmentation

TRACKING PROGRESS AND CHALLENGES IN IMPLEMENTING GHANA STATISTICAL DEVELOPMENT PLAN

Lecture 25: Multi-view stereo, continued

Stereo with projection

First Person Point of View

First Person Point of View

A Nonparametric Statistical Approach for Stereo Correspondance

Work Plan View

Detecting and Tracking Tractor-Trailers Using View-Based Templates

Adaptive Topic Tracking at Maryland

Application of Stereo Vision in Tracking

Multi-View Stereo : A Parametric Study

FIRST PERSON POINT OF VIEW

Detecting and Tracking Tractor-Trailers Using View-Based Templates