330 likes | 518 Views
Using a Webcam as a Game Controller. Jonathan Blow GDC 2002. Motivation. A potentially rich control paradigm, allowing for nuance. Removes the barrier of some funny plastic controller. Successful experiment: Konami’s Police 911. My game: Air Guitar.
E N D
Using a Webcam as a Game Controller Jonathan Blow GDC 2002
Motivation • A potentially rich control paradigm, allowing for nuance. • Removes the barrier of some funny plastic controller. • Successful experiment: Konami’s Police 911
My game: Air Guitar • A “beat-matching” game where you stand and play air guitar to your favorite songs. • Previous beat-matching games (Parappa, DDR) are very digital; I want to use a webcam to make Air Guitar more organic and to allow the user to be expressive. • Technically demanding as a vision app (needs semantics about what is what).
Real-World Concerns • Noise • Illumination changes • Camera auto-adjusts • Background changes / camera moves • Shadows • Camera saturation / under-excitement
Varying Lighting Conditions • Can’t rely on RGB values to identify pixels • Need context… hmm… this becomes a hard AI problem.
Vision Techniques That Suck • Background subtraction (shadows, motion!) • Noise reduction by smoothing (resolution!) • Turning functions (unstable) • Frame coherence (just a band-aid) • Edge detection • Hysteresis (Latin for “cheap hack”) • Discreteness
General Paradigm • Technique should: • Work on a still image • Be robust: avoid discrete decisions wherever possible. • Work in as general a case as we should manage, but we won’t strive to be ideally general. • We will do “whatever it takes” to get the job done.
Restrained Ambition • Only trying to roughly determine the positions of torso and arms • Okay to say “the user must wear a long-sleeved shirt of uniform color that contrasts with the background” • We won’t dictate the color of the shirt (too restrictive!) • We won’t dictate colors of other things (user’s skin, background).
Early Segmentation • Divide up the image into regions of “like” pixels to ease computation. • Ad hoc technique: iterate over scanlines potentially adding each pixel to its neighbor’s group. • This technique sucks.
The Unreasonable Instability of Approximate Clustering • “Real” clustering is slow • “Loose” clustering is interactively unstable • Even just the small amount of camera noise makes things go berserk… motion is even worse. • Clustering is about continuous ==> discrete. We wanted to avoid that so we should be very careful.
My solution: Be Inflexible • Simply divide the image into square regions of constant size. • If any region needs more detail, subdivide it. • Noise still affects this system (some regions subdividing / recombining from frame to frame) but it’s relatively stable.
Which color space do we work in? • Want to group pixels that are “alike”: nearby in some color space. • Choices: nonlinear RGB, linear-light RGB, CIE LAB, many others. • CIE LAB produced nicer results for some ad hoc segmentation experiments, but is expensive to compute. • Linear-light RGB is the right thing for inverse rendering techniques; it is cheap to compute. • I started with CIE LAB, but now use linear RGB.
Simple Inverse Rendering • Assume all surfaces have Lambertian reflectance • p = mlcosθ… θis angle between light and surface normal. • Can’t disambiguate material color from illuminant color • The compound color ml, under varying scale, forms a vector through the origin in RGB space. • This is a much more specific relation than e.g. Euclidean distance.
Covariance Bodies: • 5 numbers’ worth of storage • Ellipsoid-shaped (take eigenvectors of matrix) • Statistical significance: expected value of points • Advantage: consistency under summation • Can use them to vaguely characterize shapes. • Generalizes to n dimensions.
Covariance Bodies for Color Plane Fitting • Least-squares plane fit uses the same matrix. • Track RHS. 3 more numbers: • Sum these to get group plane fits. • (example)
Calibration Mode • Stand in a fixed pose • Pose designed to be easily recognizable • Gives us things that help later: • Body measurements • Background of scene • Shirt color (and histogram) • Skin color • Coarse model of environment illumination
How We Recognize This Pose • Pick a color to look for; isolate it. • Project this color to the X and Y image axes • Find spikes in projection • Use heuristics to judge shape and give a confidence value: • Outliers • Relative spike sizes • Screen real-estate occupied • (example)
Try many colors. • Sort colors present in scene by popularity; cluster them. • Create a fuzzy color cone through each cluster. • Vary the cone radius. • Do the recognition listed on previous slide; select the color cone with the best score. • Fixed color grid (to combat instability!)
Head Finding • Many heuristics: • Medium-detail region (Flatness + sharpness) • But not a long sharp edge • Compact body • Skin-colored • Not the background
Skin color? • Fit points in RGB space with an approximating surface? • Where do I get a good skin color database?
www.hotornot.com! • I get to work and check people out at the same time. • (app demo)
Gameplay Recognition Mode • Goal: Find positions of user’s torso and arms. • When we’re actually playing the game, we use the info provided by calibration to help us. • Currently only use shirt + skin color.
Body Shape Analysis • Slide a square window across the image; for each window position, use the pixel regions falling within the window to perform a local shape analysis. • Examine the resulting ellipses to find the arms. These are long, centered ellipses; round regions are the torso. (example) • Path-trace these to get an ordered series of points representing each arm. • Fit one or two line segments to this series of points (one segment = straight arm, two = bent).
Hands in front of body? • The arm will blend into the body. • The hands will look like “holes” in the body. • This messes up arm detection.
Multi-step Process: • Do a sliding window pass; approximate extents of torso using initial set of regions (holes may be there). • Look for hand-colored blobs in this area. • Merge those blobs with the set of torso regions. • Do another sliding window pass, now detecting elongated shapes (for arms).
Creating a 3D character pose from 2D information • Resolve ambiguities with game-domain constraints (e.g. hands always within some plane in front of torso). • Use inverse kinematics and some simple body knowledge to recover 3D joint angles. • See the column “The Inner Product” in the April 2002 issue of Game Developer for an explanation of 3D IK, and source code.
Method Advantages • It’s reasonably fast • Works with moving background / camera • Doesn’t care much about shadows
Method Shortcomings • Currently confused by similar colors (low clustering resolution) • Requires a few more technical solutions before it will be truly robust (e.g. auto gamma detection).
Future Work • Performance: 640x480 @ 30fps • More inverse rendering work (specularity) • Local surface modeling (eliminate confusion due to similar colors) • Texture classification • Mental model feedback
Coding Issues • How do you get video images from a webcam in Windows? • VFW code by Nathan d’Obrenan in Game Programming Gems 2 • Unfortunately, VFW is a legacy API • DirectShow is the thing you need to use for future compatibility.
DirectShow is terrible! • Needlessly complex and bloated. • The base classes provided in the DirectX SDK induce a lot of latency (latency = death) • A minimal implementation of “just give me a damn frame from the camera” took 1,500 lines of code; should have taken 8. • Ask me if you want the source code (jon@bolt-action.com) • Or use VFW or a proprietary API.
Blatant Plug • Experimental Gameplay Workshop • Friday, 4pm-7pm, Fairmont Regency I