CONFUCIUS: an Intelligent MultiMedia storytelling interpretation & presentation system

CONFUCIUS: an Intelligent MultiMedia storytelling interpretation & presentation system Minhua Eunice Ma Supervisor: Prof. Paul Mc Kevitt School of Computing and Intelligent Systems Faculty of Informatics University of Ulster, Magee

Objectives of CONFUCIUS • To interpret natural language story and movie (drama) script input and to extract conceptual semantics from the natural language • To generate 3D animation and virtual worlds automatically from natural language • To integrate 3D animation with speech and non-speech audio, to form an intelligent multimedia storytelling system for presenting multimodal stories

Story in natural language Storywriter /playwright Speech (dialogue) User /story listener Movie/drama script CONFUCIUS 3D animation non-speech audio Tailored menu for script input CONFUCIUS’ context diagram

Previous systems • Schank’s CD Theory (1972) • Primitive & scripts • SAM & PAM • Automatic Text-to-Graphics Systems • WordsEye (Coyne & Sproat, 2001) • ‘Micons’ and CD-based language animation (Narayanan et al. 1995) • Spoken Image (Ó Nualláin & Smith, 1994) & its successor SONAS (Kelleher et al. 2000)

MultiModal interactive storytelling • AesopWorld • KidsRoom • Larsen & Petersen’s Interactive Storytelling • Oz • Computer games • Virtual humans & embodied agents • BEAT (Cassell et al., 2000) • Jack (University of Pennsylvania) • Improv (Perlin and Goldberg, 1996) • SimHuman • Gandalf • PPP persona

Architecture of CONFUCIUS Natural language stories Script writer Script parser Prefabricated objects (knowledge base) lexicon grammar etc Natural Language Processing Text To Speech Sound effects Language knowledge 3D authoring tools, existing 3D models & character models semantic representations mapping visual knowledge Animation generation visual knowledge (3D graphic library) Synchronizing & fusion 3D world with audio in VRML

Semantic representations

MultiModal semantic representation Multimodal semantics High-level multimodal semantic representation: XML/frame-based Media-independent representation Visual media-dependent representation Intermediate level Audio media-dependent representation Non-speech audio modality Visual modality Language modality

Mental imagery & meaning processing Meanings, communicable ideas, thoughts, manifestable messages, proverbs, examples, parables, etc. Simulation: presentation via language or other modalities Mental world Mental world Communication Simulation: Image recognition Simulation: Language understanding Cognition Re-cognition Physical world Virtual world

Knowledge base of CONFUCIUS knowledge base Semantic knowledge - lexicons (eg. WordNet) Syntactic knowledge - grammars Statistical models of language Associations between words Language knowledge Object model (nouns) Functional information Internal coordinate axes (for spatial reasoning) Associations between objects Event model (event verbs, describes the motion of objects) Visual knowledge World knowledge Spatial & qualitative reasoning knowledge

Graphic library objects/props characters geometry & joint hierarchy files Simple geometry files instantiation motions animation library (key frames)

Data Flow Diagram Primitives library Natural language processor Animation generator Visual semantics VRML without sound nodes Scene&Actor descriptions Media coordination Synthesized animation TTS dialogues Script parser script Non-speech audio Sound effect driver script Script writer story Music library

Animation generator LCS representation verb semantic analysis use lexical relations (WordNet) to replace synonyms, scripts application, etc. match basic motions in library? Y N motion decomposition animation controller motion instantiation environment placement VRML format of the virtual story world examples demo

Categories of events • Atomic entities • Change physical location such as position and orientation, e.g. “bounce”, “turn” • Change intrinsic attributes such as shape, size, color, and texture, e.g. “bend”, and even visibility, e.g. “disappear”, “fade” (in/out) • Non-atomic entities • Non-character events • Two or more individual objects fuse together, e.g. “melt” (in) • One object divides into two or more individual parts, e.g. “break” (into pieces) • Change sub-components (their position, size, color), e.g. “blossom” • Environment events (weather verbs), e.g. “snow”, “rain” • Character events • Action verbs • Intransitive verbs • Transitive verbs • Non-action verbs (stative, emotion, possession, mental activities, cognition & perception) • Idioms & metaphor verbs

involve speech modality Categories of action verbs • Intransitive verbs • Biped kinematics, e.g. “walk”, “swim”, & other motion models like “fly” • Face expressions, e.g. “laugh”, “anger” • Lip movement, e.g. “speak”, “say” • Transitive verbs • single object, e.g. “throw”, “push”, “kick” • multiple objects • direct and indirect objects, e.g. “give”, “pass”, “show” • indirect object & the instrument, e.g. “cut”, “hammer”

one many many many Visual definition & word sense polysemy verb word sense visual definition entry mapping synonymy • a normal door (rotation on y axis) • a sliding door (moving on x axis) • a rolling shutter door (a combination of rotation on x axis and moving on y axis) Example: “close” (a door) word sense -- minimal complete unit of meaning in the language modality visual definition entry -- minimal complete unit of meaning in the visual modality

Troponyms & verbs derived from adjectives/nouns • troponym • elaborates the manners of a base verb (Fellbaum 1998) • examples: “trot”-“walk” (fast), “gulp”-“eat” (quickly) • base verb + adverb present the base verb + modify the manner (speed, the agent’s state, duration of the activity, iteration, etc.) • Verbs derived from adjectives or nouns • change objects’ properties (size, color, shape) or the world state • verbs with affixes such as –en, -ify, or –ize, e.g. “lengthen” • using predicates scale(), squash() or changing the corresponding property fields of the object in VRML

Representing active & passive voice • active and passive voice • converse verb pairs such as “give/take”, “buy/sell”, “lend/borrow” • same activity from different point of view • use of VRML Viewpoint node

Implementation: semanticsVRML DEF ball Transform { translation 0 0 0 children [ DEF ball-TIMER TimeSensor { loop TRUE cycleInterval 0.5 }, DEF ball-POS-INTERP PositionInterpolator { key [0, 0.5, 1 ] keyValue [0 0 0, 0 20 0, 0 0 0 ] }, Shape { appearance Appearance { material Material {} } geometry Sphere { radius 5 } }] ROUTE ball-TIMER.fraction_changed TO ball-POS-INTERP.set_fraction ROUTE ball-POS-INTERP.value_changed TO ball.set_translation } (c) Output  VRML code of a bouncing ball Example: “A ball is bouncing” bounce(ball):- [moveTo(ball, [0,0,0]), moveTo(ball,[0,20,0])]L. (a) visual definition of “bounce” DEF ball Transform { translation 0 0 0 children [ Shape { appearance Appearance{ material Material{} } geometry Sphere { radius 5 } } ] } (b) VRML code of a static ball

Categories of adjectives Objects’ attributes/states: dark/light, large/small, big/little, white/black (color adj.), long/short, new/old, high/low, full/empty, open/closed Visually observable Feelings: happy/sad, angry, excited, surprised, terrified Observable human attributes Others: old/young, beautiful/ugly, strong/weak, poor/rich, fat/thin Relational adj.: nasal (nose), mural (wall), dental (teeth) Perceivable by other modalities: wet/dry, warm/cold, coarse/smooth, hard/soft, heavy/light Unobservable human attributes (virtue): good/evil, kind, mean, ambitious Visually unobservable Abstract attributes Others: easy/difficult, real, important, particular, right/wrong, early/late Reference-modifying adj.: possible/impossible, former, past/present, last, other, different/same

Software Analysis • Java programming language • parsing intermediate representation • changing VRML code to create/modify animation • integrating modules • Natural language processing tools • Gate (pre-processing) • PC-PARSE (morphologic and syntax analysis) • WordNet (lexicon, semantic inference) • 3D graphic modelling • existing 3D models on the Internet • 3D Studio Max (props & stage) • VRML (Virtual Reality Modelling Language) 97, H-anim 2001 spec. • The Actors – using embodied agents • Microsoft Agent (the narrator and minor actors) • Character Studio, Internet Character Animator (protagonists)

Pre-processing Coreference resolution Part-of-speech tagger LEXICON & MORPHOLOGICAL RULES Syntactic parser morphological parser Temporal reasoning Natural Language Processing PC-PARSER FEATURES Semantic inference WordNet 1.6

Contribution & prospective applications • multimodal semantic representation of natural language • automatic animation generation • multimodal fusion and coordination • Children’s education • Multimedia presentation • Movie/drama production • Script writing • Computer games • Virtual Reality

Conclusion • The objectives of CONFUCIUS meet the challenging problems in language visualisation: • formalizes meaning of action verbs and states • mapping language primitives with visual primitives • a reusable ‘common sense’ knowledge base for other systems • sophisticated spatial and temporal reasoning • representing stories by temporal multimedia requires significant coordination

CONFUCIUS: an Intelligent MultiMedia storytelling interpretation & presentation system