180 likes | 197 Views
This paper discusses the usage, technological challenges, and architecture of a Reactive Virtual Trainer (RVT) that integrates reactive and proactive actions in a multi-modal sync. It explores the application of RVT as a medium and empathetic consultant in various scenarios, such as preventing RSI, preserving/restoring physical condition, and acting as a physiotherapist. The relevance of RVT for society is highlighted in the context of an aging population and unhealthy lifestyle. The paper also examines the challenges of multi-modal synchronization and planning in real-time.
E N D
Towards a Reactive Virtual Trainer Zsófia Ruttkay, Job Zwiers, Herwin van Welbergen, Dennis Reidsma HMI, Dept. of CS, University of Twente Amsterdam, The Netherlands zsofi@cs.utwente.nl
Overview • RVT usage • Related work • RVT technological challenges • Architecture • Integration of reactive and proactive actions • Multi-modal sync • A close look at clapping - demos
RVT usage • RVT = IVA with expert and psychological knowledge of a real physiotherapist, to be used e. g. to: • prevent RSI for computer workers • preserve/restore weight and physical condition as (personal) trainer • act as physiotherapist to cure illnesses affecting motion • RVT is medium and emphatic consultant • Relevance for society • ageing population, unhealthy life-style, • human experts: low number, expensive, at certain locations • RVT usage context • PC + 1-2 camera in normal setting (homes, offices) • ‘instructed’ by authorized person (may be the user, as well as developer) • can be adapted/extended
RVT technological challenges • Vision-based perception, may be extended with biosignals • Reactive on exercise performance, physical state, overall performance • Smalltalk, exercise correction, plan revision • VRT body and motion parameters adaptable/calibrated • Authoring by human • Extensible by expert (new exercises) • Motion with music, speech or clapping (also as input for tempo) • Playground for multi-modal output generation • “Exercise motion intelligence”: timing, concatenation, idle poses, …
Human expert Monitoring the user • Planning action of VT Presentation of feedback of VT RVT architecture Calibration of user Authoring scenario Multi-modal feedback Exercise sce-nario revision Motion interpretation Multi-sensor integration Motion specification Motion demonstration Interfaces Biosensing module(s) Optical motion tracking Acoustic beat tracking VT User
Multi-modal sync • Exercises are executed using several modalities • Body movement • Speech • Music • Sound (clap, foot tap) • Challenges • Synchronization • Monitoring user => real time (re)planning • Exaggeration to point out details • Speed up / slow down • Feedback/correction • …
Synchronization: related work • Classic approach in speech/gesture synchronization: • Speech leads, gesture follows • MURML (Kopp et al.) • No leading modality • Planning in sequential chunks containing one piece of speech and one aligned gesture • Co-articulation at the border of chunks • BML (Kopp, Krenn, Marsella, Marshall, Pelachaud, Pirker, Thórisson, Vilhjalmsson) • No leading modality • Synchronized alignment points in behavior phases • For now, aimed mainly at speech/gesture synchronization • In development
Synchronization: own previous work • Virtual Dancer • Synchronization between music (beats) and dance animation • Dance move selection by user interaction • Virtual Presenter • Synchronization between speech, gesture, posture and sheet display • Leading modality can change over time • GESTYLE markup language with par/seq and wait constructs
Close look at clapping stroke (hold) retraction (hold)
Close look at clapping • Start with a simple clap exercise and see what we run into • The clap exercise: • Clap for the tempo of the beat of a metronome (later: of music) • When the palms touch, a clap sound is heard • Count while clapping, using speech synthesis • Possible alignment at: word start/end, phonological peak start/center/end • For now, we pick the center of the phonological peak, but we do generate the other alignment points for easy adaptation
Two examples for multi-modal sync • Specification in BMLT • Planning in real-time – under/overspecification!
What if we speed up the tempo? • The clapping animation should be faster • Possibilities: • Lower amplitude? • Linear speedup? • Speedup of stroke? • Speedup of retraction? • A combination of above?
What if we slow down the metronome? • Slower clapping? (movies here) • Linear slowdown? • Slowdown of stroke? • Slowdown of retraction? • Hold at end of retraction (hands open)? • Hold after stroke (clap)? • A combination of above? • Back to idle position?
Open issues on planning • What do real humans do? • Do the semantics of a motion (clap) change if we change its amplitude or velocity profile? E.g. emotions, individual features • Smooth tempo changes • Automatic concatenation and inserted idle poses • Appropriate high-level parameters • Related (e.g. amplitude/speed)? • Different of parameters for communicative gestures (e.g. by Pelachaud)? • Amplitude and motion path specification • Is our synchronization system capable to re-plan in real time?