380 likes | 578 Views
From User-friendly to User’s Friend. Dr. Eric Petajan Founder and Chief Scientist face2face animation, inc. www.f2fanimation.com eric@f2f-inc.com. Why vision is required for the ideal HCI design. Problem Statement.
E N D
From User-friendlyto User’s Friend Dr. Eric Petajan Founder and Chief Scientist face2face animation, inc. www.f2fanimation.com eric@f2f-inc.com Why vision is required for the ideal HCI design
Problem Statement The electronic extension of human capabilities is primarily limited by Human-Computer Interaction (HCI) systems that fail to meet our needs for fast, reliable, and secure input of information using the most comfortable human communication modes
Your computer should emulate your best friend • It should know who you are and if you are present • It should see and hear you in adverse conditions • It should respond to you quickly • It should tell you the truth • It should keep your secrets • It should be pleasant or entertaining • It should follow you around
A humanoid agent is a necessary component for the ultimate HCI
Humanoids can provide: • Clear focus for audio and visual attention • Easier to capture user behavior • Less taxing for user • Perception of credibility • Engagement and entertainment • Increased comprehension • Guidance with traditional information display
The quality of the virtual human is critically dependent on the amount of real human behavior that informs the humanoid model Autonomous humanoid agents can’t pass the Turing test today
The non-invasive captureand machine understandingof human behaviorare grand challenges that have yet be fully accomplished We are still tethered to the keyboard and mouse
Significant Human Behaviors Available without Contact • Audio/Visual Speech • Gestures • Facial expressions • Gaze direction • Posture
“AI” Engine • Knowledge • Motive • Power Capture Human Behavior Ideal HCI Process Graph Capture Complete Human Behavior Build Humanoid Model Present Humanoid To Human What has been achieved to date?
The Good News • Processing hardware is fast and cheap • HD cameras now 10 times cheaper • Displays are good and cheap enough • Mobile data bandwidth is reliable enough for audio plus animation streams • Individual recognition technologies are approaching maturity (if not utility)
The Bad News • Computers can’t reliably “hear” humans with a single fixed microphone • Computers can’t reliably “see” humans with a single cheap video camera • HCI constraints exhaust and encumber users • Large segments of the population are unwilling or unable to engage in HCI
Steps in the Right Direction • Use one or more HD video cameras • Use steered microphone array with face tracking • Track and control users attention with humanoid • Continuously identify the user • Train the user with entertainment • Use dedicated hardware to minimize the impact of the HCI system on general computing and communication tasks
Multi-modal Speech Recognition • Audio-visual speech and speaker recognition provides robustness in noise • Use of visual speech removes need for close-talking microphone and provides robust steering of microphone array • MPEG-4 Face Animation Parameters (FAPs) accurately encode visual speech
People want information and communication where ever they happen to be • Mobile devices need to be small (thin client) • Device and service costs must be low • Must be fast and reliable • Bandwidth must be used efficiently for low latency and cost
People want to be entertained • Entertaining information is retained better • Personality attracts attention and is main component of entertainment • Personality is manifested mostly in face and voice • Face and voice must be synced and delivered with quality (high frame rate)
People like animated characters • Entertaining/relationship forming • Can be efficiently delivered anywhere • Graphical faces scale well to small screens • Character design limited only by imagination • Any person can drive any character (with FAPs) • Emotional response to animated faces is hardwired
Mobile devices today • Can deliver animated characters • Are cheap • Can deliver low bit-rate content reliably • Are communicators and entertainers • Are very popular
User Input to Mobile Devices • Keyboards are impractical for mobile devices • Best user interface is speech and face • Little room for text/menus on small screens • Acoustic speech recognition is unreliable in mobile environments • Visual speech and face recognition are needed for robust mobile user interface
Low bit-rate is the key to mobile happiness • Reliable delivery of wireless video will not happen for a very long time • Only 20-30 kilobits/sec can be sustained everywhere • MPEG-4 animation streams fit in available bandwidth with audio • 2 kilobits/sec for face animation data • 6-10 kilobits/sec for body animation data
Mobile Character Player Demo • Facial expressions, lip movements and head motion extracted from ordinary video automatically as FAPs • FAPs streamed to player with compressed audio at 10 kbps total • 300 triangle 3D mesh face model renders in real time on phone • FAPs and audio decoded in parallel with graphics rendering in software
Standards • Facilitate collaboration • Minimize reinvention of wheels • Decrease costs with economies of scale • Allow database sharing • Provide free or cheap source code • Enable low latency communication
The MPEG-4 Standard • Provides comprehensive framework for 2D and 3D multimedia communication • Provides Face and Body Animation (FBA) representation and coding • Low bit-rate coding eliminates network bottlenecks • Optimized implementations increase speed and reduce costs to consumers
MPEG-4 Face Animation • Face model is independent of Face Animation Parameters (FAPs) • FAPs contain high quality animation data for driving all types of face models from broadcast to wireless • FAPs displace feature points from neutral position
Body Animation • Harmonized with VRML Hanim spec • Body Animation Parameters (BAPs) are humanoid skeleton joint Euler angles • Body Animation Table (BAT) can be downloaded to map BAPs to skin deformation • BAPs can be highly compressed for streaming
Body Animation Parameters (BAPs) • 186 humanoid skeleton euler angles • 110 free parameters for use with downloaded body surface mesh • Coded using same codecs as FAPs • Typical bitrates for coded BAPs is 5-10kbps
Neutral Face Definition • Head axes parallel to the world axes • Gaze is in direction of Z axis • Eyelids tangent to the iris • Pupil diameter is one third of iris diameter • Mouth is closed and the upper and lower teeth are touching • Tongue is flat, horizontal with the tip of tongue touching the boundary between upper and lower teeth
11.5 11.5 11.4 11.4 11.2 11.1 11.1 11.2 11.3 4.3 4.4 4.4 4.2 4.1 4.6 4.5 4.2 11.6 4.6 10.2 10.1 10.2 10.9 10.10 10.10 10.3 10.4 5.3 5.4 5.4 10.7 10.8 10.4 10.5 10.6 10.8 5.2 5.1 10.6 5.2 y x y 2.13 2.10 2.14 z 7.1 2.10 2.1 2.11 2.12 2.14 2.12 x 2.1 z 3.13 3.14 3.2 3.1 3.8 3.11 3.5 3.6 3.12 3.7 3.3 3.4 3.10 3.9 9.6 9.7 Right eye Left eye 9.8 9.12 Nose 9.14 9.13 9.10 9.11 9.3 9.1 9.9 9.2 9.15 9.5 9.4 8.9 8.10 8.6 8.5 8.1 8.4 8.3 2.7 2.6 2.4 2.5 2.2 6.3 6.4 6.2 2.8 2.9 2.3 8.7 8.8 Tongue Mouth 8.2 6.1 Face Feature Points Teeth Feature points affected by FAPs Other feature points
Face Model Independence • FAPs are always normalized for model independence • FAPs (and BAPs) can be used without MPEG-4 systems/BIFS • Private face models can be accurately animated with FAPs • Face models can be simple or complex depending on terminal resources
Face Animation Parameter Normalization • Face Animation Parameters (FAPs) are normalized to facial dimensions • Each FAP is measured as a fraction of neutral face mouth width, mouth-nose distance, eye separation, or iris diameter • 3 Head and 2 eyeball rotation FAPs are Euler angles
Lip FAPs Mouth closed if sum of upper and lower lip FAPs = 0
FAP Compression • FAPs are adaptively quantized to desired quality level • Quantized FAPs are differentially coded • Adaptive arithmetic coding further reduces bitrate • Typical compressed FAP bitrate is less than 2 kilobits/second
FAP Predictive Coding + FAP(t) Q - Bitstream Arithmetic Coder Frame Delay Q-1
General Bandwidth Issues • Broadband deployment is happening slowly • 3G will not be ubiquitous for many years • DSL availability is limited and cable is shared • Talking heads need high frame-rate • Consumer graphics hardware is cheap and powerful • MPEG-4 FBA tools are matched to available bandwidth and terminals
Markerless Facial Motion Capture for Animation Production • Track/analyze face features in each video frame • Captured face feature motion easily converted to FAPs • Face model is “puppeteered” by FAPs • MPEG-4 FAPs only specify motion of feature points (not surrounding surface)
Automatic Face Animation Demonstration • FAPs extracted from camcorder video • Inner lip, eye region and head rotation FAPs compressed to less than 2 kbits/sec • 30 frames/sec animation generated automatically • Face models developed with face2face plugin Maya
Conclusions • Humanoid agents are required for best HCI • Vision-based facial capture is required for humanoid design and human behavior capture • MPEG-4 Face and Body Animation coding enables high quality mobile communication • Ultimate HCI systems must continuously see, hear and identify the user for best reliability and security