220 likes | 233 Views
Explore the use of speech, language, and vision in instructional dialogs for robots. Acquire terms not knowable a priori and learn objects and actions through innate perception and action abilities.
E N D
Far Reaching Research (FRR) Project See, Hear, Do:Language and Robots Jonathan Connell Exploratory Computer Vision Group Etienne Marcheret Speech Algorithms & Engines Group Sharath Pankanti (ECVG) Josef Vopicka (Speech) Title slide
Challenge = Multi-modal instructional dialogs Use speech, language, and vision to learn objects & actions Innate perception abilities (objects / properties) Innate action capabilities (navigation / grasping) Easily acquire terms not knowable a priori Example dialog: Round up my mug. I don’t know how to “round up” your mug. Walk around the house and look for it. When you find it bring it back to me. I don’t know what your “mug” looks like. It is like this <shows another mug> but sort of orange-ish. OK … I could not find your mug. Try looking on the table in the living room. OK … Here it is! command following verb learning noun learning advice taking Language Learning & Understanding is a AAAI Grand Challenge http://www.aaai.org/aitopics/pmwiki/pmwiki.php/AITopics/GrandChallenges#language
Eldercare as an application • Example tasks: • Pick up dropped phone • Get blanket from another room • Bring me the book I was reading yesterday • Large potential market Many affluent societies have a demographic imbalance (Japan, EU, US) Institutional care can be very expensive (to person, insurance, state) • A little help can go a long way Can be supplied immediately (no waiting list for admission) Allows person to stay at home longer (generally easier & less expensive) Boosts independence and feeling of control (psychological advantage) • Note: We are not attempting to address the whole problem X Aggressive production cost containment X Robust self-recharging and stairs traversal X Bathing and bathroom care, patient transfer, cooking X OSHA, ADA, FDA, FCC, UL or CE certification
Novel approach: Linguistically-guided robots Use language as the core of the operating system not something tacked-on after-the-fact • Interface Much easier than programming (textual or graphical) More natural for unskilled users Less effort for “one-off” activities • Interaction Simple progress / error reporting (“I am entering the kitchen”) Easy to request missing information (“Please tell me where X is located.”) Clarification dialogs possible (“Which box did you want, red or blue?”) • Learning Can direct attention to specific objects or areas (e.g. “this object”) Can focus learning on relevant properties (e.g. color, location) Less trial and error since richer feedback (i.e. faster acquisition)
ELI the robot • Power supply • 528 WH sealed lead-acid batteries • 28 lbs for balancing counterweight • Estimate 4-5 hr run-time • Drive system • Two wheel differential steer • Two 4 in rear casters (blue) • 47 in/sec (2.7 mph) top speed • Handles 10 deg slope, ½ in bumps • Motorized lift • For arm & sensors (offset 27 in up) • Floor to 36 in (counter) range • 16 in / sec = 2.3 sec bottom to top • Computation • Platform for quad-core GPU laptop • Single USB cable for interface • Overall • About 65 lbs total weight • Stable +/- 10 degs any direction • 15 in wide, 24 in long, 45-66 in tall
Joystick control video Picking up a dropped object eli_kitchen.wmv
Speech interaction video Far-field speech interpretation eli_voice.wmv
camera arm OTC medications (Advil & Gaviscon) Detached Arm for dialog development • Hardware • Single color camera 25 in above surface • Arm = 3 positional DOF, Wrist = 3 angular DOF • Gripper augmented with compliant closure • Workspace = 2 ft wide, 1ft deep, +8/-2 in high • Software • Serial control code optimized • Joint control via manual gamepad • Inverse kinematic solver
Speech manipulation video Selecting and disambiguating objects eli_table.wmv
Dialog phenomena handled • “Grab it.” (1 object) • <grabs object> no confusion since only 1 choice for “it” • “Grab it.” (4 objects) • “I'm confused. Which of the 4 things do you mean?” knows a unique target is required • “What color is the object on the left?” (4 objects) • “It’s blue.” understand positions & colors • “Grab it” (4 objects) • <grabs blue object> uses “it” from previous interaction • “Grab that object” (human points) • <grabs object> understands human gesture • “Grab the white thing.” (2 white objects) • “Do you mean this one?” <robot points> uses gesture to suggest alternative • “No, the other one.” • <grabs other object> uses “other” from previous interaction • “Grab the green thing.” • “Sorry, that’s too big for me.” sensitive to physical constraints
Noun Learning Scenario • Features: • Automatically finds objects • Selects by position, size, color • Understands user pointing • Robot points for emphasis • Grabs selected object • Passes object to/from user • Adds new nouns to grammar • Builds visual models • Identifies objects from models eli_noun_sub.wmv
Multi-modal Dialog Script • “Eli, what is the object on the left?” No existing visual model matches object “I don’t know.” • “Eli, that is aspirin.” New word added to grammar New visual model for object “Okay. This <points> is aspirin.” • “Eli, this object <points> is Advil.” Word already known New visual model for object “Okay. That is Advil.” Model = size + shape + colors • “Eli, how many Advil do you see?” Uses existing visual model to find item(s) “I see two.” Matching = nearest neighbor • dist = Σ w[i] * | v[i] – m[i] |
Multi-modal Dialog Script (continued) • “Eli, give me the Tylenol.” • Uses existing visual model to find item(s) • <gets bottle> “Here you go” • Waits for user hand motion • <releases> • Waits for user hand motion • <regrabs bottle> “Thanks.” • <replaces bottle> • “Eli, where is the aspirin?” Uses existing visual model to find item(s) “Here.” <points>
Eli Robot at Watson Brainy Response System at Tokyo Vision Objects Archive Visual models context update ASR Parser Vocabulary Network Reasoning Semantic memory Lifelog Action models vetoes, recommendation Talk Kinematics Sequencer Retrieve Collaboration with Toyko Research Lab • Principle researchers: • Michiharu Kudoh • Risa Nishiyama • “BRAINS” project goal: • Make the robot respond appropriately as if it understands social rules
Combined Demo • Features: • Learns object names • Learns object appearances • Grabs and passes objects • Vetoes actions based on DB • Picks alternates using ontology • Checks for valid dose interval • Real-time cloud connection eli_bottles_sub.wmv
“Alice” aspirin NO DB Combined Demo Script • “Eli, this <points> object is aspirin.” • New word added to grammar • New visual model for object • “Okay. That is aspirin” • “Eli, the object on the right is called Tums.” Word already known New visual model for object “Okay. This <points> is Tums.” • “Eli, give me some aspirin.” Uses existing visual model to find item(s) Check against personal database “But that will hurt your stomach.”
antacid Tums (present) Rolaids (requested) lifelog history 7:14 AM xxxxx 8:39 AM zzzzz 9:01 AM took Tylenol Combined Demo Script (continued) • “Eli, give me some Tylenol instead.” • Uses existing visual model to find item(s) • <gets bottle> “Here you go” • Waits for user hand motion • <releases> • Waits for user hand motion • <regrabs bottle> “Thanks.” • <replaces bottle> • Records dose in lifelog • “Eli, give me some Rolaids.” No visual model for item “I don’t know what Rolaids looks like.” Ontology used to find available alternative(s) “Do you want another antacid, Tums?” • “Eli, just give me some Tylenol.” Uses existing visual model to find item(s) Lifelog consulted for last dose “You just had Tylenol.”
Verb Learning Scenario • Features: • Handles relative motion commands • Responds to incremental positioning • Learns action sequences • Applies new actions to other objects eli_verb_sub.wmv
“poke” out 1.0 out -1.0 • “Eli, extend your hand.” • Low level incremental move • <advances> • “Eli, retract your hand.” • Low level incremental move • <retreats> Verb Learning Script • “Eli, poke the thing in the middle.” • Resolves visual target based on position • No existing action sequence to link • New action sequence opened for input • “I don’t know how to poke something.” point 1.0 • “Eli, point at it.” • Resolves pronoun from previous selection • Moves relative to visual target • <points>
“poke” point 1.0 out 1.0 DB out -1.0 • “Eli, poke the object on the left.” • Resolves visual target based on position • Retrieves action sequence for verb and executes • <pokes> Verb Learning Script (continued) • “Eli, that is how you poke something.” • Recognizes closing of action block • Links action sequence to word • “Okay. Now I know how to now poke something.” • “Eli, poke the red object.” • Resolves visual target based on color • Retrieves action sequence for verb and executes • <pokes> • “Eli, poke the Tylenol.” • Resolves visual target based known object model • Retrieves action sequence for verb and executes • <pokes>
Project Milestones • Year 1 : Establishing the Language Framework (2011) table-top environment with off-the-shelf arm / cameras / mics Visual detection & identification of objects Visual servoing of arm to grasp objects Speech-based naming of objects Speech-based learning of motion routines • Year 2 : Extension to Application Scenario (2012) port to mobile platform with on-board power & processing Vision-based obstacle avoidance Visual grounding for rooms / doors / furniture Speech adaptation for different users & rooms Speech-based place naming & fetch routines
Overcoming obstacles to widespread robotics • Perception • Robots do not conceptualize world as people do (e.g. what is an object?) • Programming • Hard to tell robots what to do short of C++ programming • Cost • Robots are too expensive for generic activities or personal use • Focus on nouns using partial scene segmentation • Separate using depth boundaries and homogeneous regions • Recognize with interest points and bulk properties • Use speech and (constrained) natural language • Learn word associations to objects and places • Simply remember spatial paths and action procedures • Substitute sensing and computation for precise mechanicals • Use cameras only, not (low volume) special-purpose sensors • Use graphics processors (GPU) instead of CPU when possible