130 likes | 273 Views
Trials and Tribulations. Architectural Constraints on Modeling a Visuomotor Task within the Reinforcement Learning Paradigm. Subject of Investigation. How humans integrate visual object properties into their action policy when learning a novel visuomotor task. BubblePop !
E N D
Trials and Tribulations Architectural Constraints on Modeling a Visuomotor Task within the Reinforcement Learning Paradigm
Subject of Investigation • How humans integrate visual object properties into their action policy when learning a novel visuomotor task. • BubblePop! • Problem: Too many possible questions… • Solution: Motivate behavioral research by looking at modeling difficulties. • Nonobvious crossroads
Approach • Since the task has a scalar performance signal, model must utilize reinforcement learning. • Temporal Difference Back Propagation • Start with an extremely simplified version of the task and add back the complexity once you have a successful model. • Analyze the representational and architectural constraints necessary for each model.
First Steps: Dummy World • 5x5 grid-world • 4 possible actions • Up, down, left, right • 1 unmoving target • Starting locations of target and agent randomly assigned • Fixed reward upon reaching target and a new target generated • Epoch ends after fixed number of steps
Dummy World Architectures Expected Reward 1 8 Hidden Layer context (ego only) 25 units for the grid 4 Actions The whole grid (allocentric), or agent centered (egocentric)
Building in symmetry • Current architectures learn each action independently. • ‘Up’ is like ‘Down’, but different. • It shifts the world • 1 action, 4 different inputs • “In which rotation of the world would you rather go ‘up’ in?”
World scaling • Scaled grid size up to 10x10 • Not as unrealistic as one might think… (tile coding) • Scaled number of targets • Difference from 1 to 2, but not from 2 to many. • Confirmed ‘winning-est’ representation • Added memory
No low hanging fruit:The ripeness problem • Added a ‘ripeness’ dimension to target, and changed the reward function: If target.ripeness>.60 reward = 1; Else reward = -.66667; How the problem occurs: • At a high temperature you move randomly. • The random pops net zero reward. • The temperature lowers and you ignore the target entirely.
A psychologically plausible solution • No feedback for almost ripe • So how could we anneal our ripeness criterion? • Anneal the amount you care about unripe pops. • Differentiate internal and extern reward functions
Future directions • Investigate how the type of ripeness difficulty impacts computational demands. • Difficulty due to reward schedule vs. perceptual acuity vs. redundancy vs. conjunctive-ness vs. ease of prediction • How to handle the ‘Feature Binding ‘Problem’ in this context • Emergent binding through deep learning? • Just keep increasing complexity and see what problems crop up. • If the model gets to human level performance without a hitch, then that’d be pretty good to.
Summary& discussion • Egocentric representations pay off in this domain, even with the added memory cost. • In any domain with a single agent? • Symmetries in the action space can be exploited to greatly expedite learning • Could there be a general mechanism for detecting such symmetries? • Difficult reward functions might be learnt via annealing internal reward signals. • How could we have this annealing emerge from the model?