140 likes | 283 Views
An Object-oriented Representation for Efficient Reinforcement Learning. Carlos Diuk, Andre Cohen and Michael L. Littman Rutgers Laboratory for Real-Life Reinforcement Learning (RL) 3 Department of Computer Science Rutgers University (New Jersey, USA). ICML 2008 – Helsinki, Finland.
E N D
An Object-oriented Representation for Efficient Reinforcement Learning Carlos Diuk, Andre Cohen and Michael L. Littman Rutgers Laboratory for Real-Life Reinforcement Learning (RL)3 Department of Computer Science Rutgers University (New Jersey, USA) ICML 2008 – Helsinki, Finland
Motivation How would YOU play this game?
What’s in a state? s1 -> a0 -> s5 s5 -> a2 -> s24 s24 -> a1 -> s1 If we know that our agents are interacting in a spatial relation with objects, let’s just tell them so. A simple hash code that tells you if you’ve been “there” before. What we (the agent) can actually “see”: objects, interactions, spatial relationships.
What we did • Grab ideas from Relational RL and come up with a representation that: • is suitable for a wide-enough range of domains • is tractable • provides opportunities for generalization • enables smart exploration • Strike a balance between generality and tractability.
OO representation • Problem defined by a set of objects and their attributes. • Example: Objects in Pitfall defined by a bounding box on a set of pixels based on color. Man.<x,y> Log.<x,y> Hole.<x,y> Ladder.<x,y> Wall.<x,y> • State is the union of all objects’ attribute values.
OO representation • For any given state s, there is a function c(s) that tells us which relations occur under s. • Dynamics defined by preconditions and effects. • Preconditions are conjunctions of terms: • Relations between objects: • touchN/S/E/W(objecti, objectj) • on(objecti, objectj) • Any (boolean) function on the attributes. • Any other function encoding prior knowledge. • Actions have effects that determine how objects’ attributes get modified. on(Man, Ladder) Action Up Man.y = Man.y + 8
DOORMax • An algorithm for efficient learning of deterministic OO-MDPs. • When objects interact, and an effect is observed, DOORMax learns the conjunction of terms that enabled the effect. • Belongs to the R-Max family of algorithms: • Guides exploration to make objects interact
DOORMax Analysis • Let n be the number of terms. • Assume that: • The number of effects per action is bounded by a (small) constant m. • Each effect has a unique conjunctive condition. • As long as effects are observed (that is, some effect occurs given an action a), DOORMax will learn the condition-effect pairs that determine the dynamics of a in O(nm). There is a worst-case bound, when lots of no-effects are observed, of O(nm).
Results What about this game? Videogame
Conclusions and future work • OO-MDPs provide a natural way of modeling an interesting set of domains, while enabling generalization and smart exploration. • DOORMax learns deterministic OO-MDPs outperforming state-of-the-art algorithms for factored-state representations. • DOORMax scales very nicely with respect to the size of the state space, as long as transition dynamics between objects do not change. • We do not have a provably efficient algorithm for stochastic OO-MDPs. • We do not yet handle inheritance between classes of objects.