Life: play and win in 20 trillion moves Stuart Russell

Life: play and win in 20 trillion movesStuart Russell

White to play and win in 2 moves

Life • 100 years x 365 days x 24 hrs x 3600 seconds x 640 muscles x 10/second = 20 trillion actions • (Not to mention choosing brain activities!) • The world has a very large, partially observable state space and an unknown model • So, being intelligent is “provably” hard • How on earth do we manage?

Hierarchical structure!! (among other things) • Deeply nested behaviors give modularity • E.g., tongue controls for [t] independent of everything else given decision to say “tongue” • High-level decisions give scale • E.g., “go to IJCAI 13” 3,000,000,000 actions • Look further ahead, reduce computation exponentially [NIPS 97, ICML 99, NIPS 00, AAAI 02, ICML 03, IJCAI 05, ICAPS 07, ICAPS 08, ICAPS 10, IJCAI 11]

Reminder: MDPs • Markov decision process (MDP) defined by • Set of states S • Set of actions A • Transition model P(s’ | s,a) • Markovian, i.e., depends only on st and not on history • Reward function R(s,a,s’) • Discount factor γ • Optimal policyπ* maximizes the value function Vπ= Σt γt R(st,π(st),st+1) • Q(s,a) = Σs’ P(s’ | s,a)[R(s,a,s’) + γV(s’)]

Reminder: policy invariance • MDP M with reward function R(s,a,s’) • Any potential function Φ(s) then • Any policy that is optimal for M is optimal for any version M’ with reward function R’(s,a,s’) = R(s,a,s’) + γΦ(s’) - Φ(s) • So we can provide hints to speed up RL by shaping the reward function

Reminder: Bellman equation • Bellman equation for a fixed policy π : Vπ(s) = Σs’P(s’|s,π(s)) [R(s,π(s),s’) + γVπ (s’) ] • Bellman equation for optimal value function: V(s) = maxaΣs’P(s’|s,a) [R(s,a,s’) + γV(s’) ]

Reminder: Reinforcement learning • Agent observes transitions and rewards: • Doing a in s leads to s’with reward r • Temporal-difference (TD) learning: • V(s) <- (1-α)V(s) + α[r+γV(s’) ] • Q-learning: • Q(s,a) <- (1-α)Q(s,a) + α[r+γ maxa’Q(s’,a’) ] • SARSA (after observing the next action a’ ): • Q(s,a) <- (1-α)Q(s,a) + α[r+γQ(s’,a’) ]

Reminder: RL issues • Credit assignment (problem: • Generalization problem:

Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc.

Abstract states • First try: define an abstract MDP on abstract states -- “at home,” “at SFO,” “at IJCAI,” etc. Oops – lost the Markov property! Action outcome depends on the low-level state, which depends on history

Abstract actions • Forestier & Varaiya, 1978: supervisory control: • Abstract action is a fixed policy for a region • “Choice states” where policy exits region Decision process on choice states is semi-Markov

Semi-Markov decision processes • Add random duration N for action executions: P(s’,N|s,a) • Modified Bellman equation: Vπ = Σs’,N P(s’,N|s,π(s)) [R(s,π(s),s’,N) + γNVπ (s’) ] • Modified Q-learning with abstract actions: • For transitions within an abstract action πa: rc = rc + γcR(s,πa(s),s’) ; γc = γcγ • For transitions that terminate action begun at s0 Q(s0,πa) = (1-α)Q(s0,πa) + α(rc + γc maxa’Q(s’,πa’) ]

Hierarchical abstract machines • Hierarchy of NDFAs; states come in 4 types: • Action states execute a physical action • Call states invoke a lower-level NDFA • Choice points nondeterministically select next state • Stop states return control to invoking NDFA • NDFA internal transitions dictated by percept

Running example • Peasants can move, pickup and dropoff • Penalty for collision • Cost-of-living each step • Reward for dropping off resources • Goal : gather 10 gold + 10 wood • (3L)n++ states s • 7n primitive actions a

Part of 1-peasant HAM

Joint SMDP • Physical states S, machine states Θ, joint state space Ω = S x Θ • Joint MDP: • Transition model for physical state (from physical MDP) and machine state (as defined by HAM) • A(ω) is fixed except when θ is at a choice point • Reduction to SMDP: • States are just those where θ is at a choice point

Hierarchical optimality • Theorem: the optimal solution to the SMDP is hierarchically optimal for the original MDP, i.e., maximizes value among all behaviors consistent with the HAM

Partial programs • HAMs (and Dietterich’s MAXQ hierarchies and Sutton’s Options) are all partially constrained descriptions of behaviors with internal state • A partial program is like an ordinary program in any programming language, with a nondeterministic choice operator added to allow for ignorance about what to do • HAMs, MAXQs, Options are trivially expressible as partial programs • Standard MDP: loop forever choose an action

RL and partial programs Partial program a Learning algorithm s,r Completion

RL and partial programs Partial program a Learning algorithm s,r Completion Hierarchically optimal for all terminating programs

Single-threaded Alisp program Program state θ includes Program counter Call stack Global variables (defun top () (loop do (choose ‘top-choice (gather-wood) (gather-gold)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action‘get-wood) (nav *base-loc*) (action‘dropoff))) (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest)) (action‘get-gold) (nav *base-loc*)) (action‘dropoff))) (defun nav (dest) (until (= (pos (get-state)) dest) (with-choice‘nav-choice (move ‘(N S E W NOOP)) (action move))))

Q-functions • Represent completions using Q-function • Joint stateω = [s,θ] env state + program state • MDP + partial program = SMDP over {ω} • Qπ(ω,u) = “Expected total reward if I make choice u in ω and follow completion π thereafter” • Modified Q-learning [AR 02] finds optimal completion

Internal state • Availability of internal state (e.g., goal stack) can greatly simplify value functions and policies • E.g., while navigating to location (x,y), moving towards (x,y) is a good idea • Natural local shaping potential (distance from destination) impossible to express in external terms

Temporal Q-decomposition Top Top Top GatherGold GatherWood Nav(Mine2) Nav(Forest1) Qr Qc Qe Q • Temporal decompositionQ = Qr+Qc+Qewhere • Qr(ω,u) = reward while doing u (may be many steps) • Qc(ω,u) = reward in current subroutine after doing u • Qe(ω,u) = reward after current subroutine

State abstraction • Temporal decomposition => state abstraction E.g., while navigating, Qc independent of gold reserves • In general, local Q-components can depend on few variables => fast learning

Handling multiple effectors Get-gold Get-wood Multithreadedagent programs • Threads = tasks • Each effector assigned to a thread • Threads can be created/destroyed • Effectors can be reassigned • Effectors can be created/destroyed (Defend-Base) Get-wood

An example Concurrent ALisp program An example single-threaded ALisp program (defun gather-gold () (with-choice ‘mine-choice (dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff))) (defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move)))) (defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas (first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff))) (defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood)))) (defun gather-wood () (with-choice ‘forest-choice (dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

Concurrent Alisp semantics Peasant 1 (defun nav (Wood1) (loop until (at-pos Wood1) (setf d (choose ‘(N S E W R))) do (action d) )) • defun gather-wood () • (setf dest (choose *forests*)) • (nav dest) • (action ‘get-wood) • (nav *home-base-loc*) • (action ‘put-wood)) Thread 1 Running Paused Waiting for joint action Making joint choice (defun nav (Gold2) (loop until (at-pos Gold2) (setf d (choose ‘(N S E W R))) do (action d) )) (defun gather-gold () (setf dest (choose *goldmines*)) (nav dest) (action ‘get-gold) (nav *home-base-loc*) (action ‘put-gold)) Peasant 2 Paused Thread 2 Running Making joint choice Waiting for joint action Environment timestep 27 Environment timestep 26

Concurrent Alisp semantics • Threads execute independently until they hit a choice or action • Wait until all threads are at a choice or action • If all effectors have been assigned an action, do that joint action in environment • Otherwise, make joint choice

Q-functions • To complete partial program, at each choice state ω, need to specify choices for all choosing threads • So Q(ω,u) as before, except u is a joint choice • Suitable SMDP Q-learning gives optimal completion Example Q-function

Problems with concurrent activities • Temporal decomposition of Q-function lost • No credit assignment among threads • Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly • Peasant 2 thinks he’s done very well!! • Significantly slows learning as number of peasants increases

Threadwise decomposition • Idea : decompose reward among threads (Russell+Zimdars, 2003) • E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants • Qjπ(ω,u) = “Expected total reward received by thread j if we make joint choice u and then do π” • Threadwise Q-decompositionQ = Q1+…Qn • Recursively distributed SARSA => global optimality

Learning threadwise decomposition Peasant 3 thread Peasant 2 thread Peasant 1 thread r2 r1 r3 Top thread r decomp a Action state

Learning threadwise decomposition Peasant 3 thread Q-update Peasant 2 thread Q-update Peasant 1 thread Q2(ω,·) Q-update Q1(ω,·) Q3(ω,·) Top thread + argmax = u Q(ω,·) Choice state ω

Threadwise and temporal decomposition Peasant 3 thread • Qj = Qj,r + Qj,c + Qj,e where • Qj,r (ω,u) = Expected reward gained by thread jwhile doing u • Qj,c (ω,u) = Expected reward gained by thread jafter u but before leaving current subroutine • Qj,e (ω,u) = Expected reward gained by thread jafter current subroutine Top thread Peasant 2 thread Peasant 1 thread

Resource gathering with 15 peasants Threadwise + Temporal Threadwise Undecomposed Reward of learnt policy Flat Num steps learning (x 1000)

Summary of Part I • Structure in behavior seems essential for scaling up • Partial programs • Provide natural structural constraints on policies • Decompose value functions into simple components • Include internal state (e.g., “goals”) that further simplifies value functions, shaping rewards • Concurrency • Simplifies description of multieffector behavior • Messes up temporal decomposition and credit assignment (but threadwise reward decomposition restores it)

Outline • Hierarchical RL with abstract actions • Hierarchical RL with partial programs • Efficient hierarchical planning

Abstract Lookahead Rxa1 c2 c2 Kxa1 Qa4 ‘a’ ‘b’ a b ‘b’ ‘a’ ‘b’ ‘a’ aa ab ba bb • k-step lookahead >> 1-step lookahead • e.g., chess • k-step lookahead no use if steps are too small • e.g., first k characters of a NIPS paper Kc3 • Abstract plans (high-level executions of partial programs) are shorter • Much shorter plans for given goals => exponential savings • Can look ahead much further into future

High-level actions [Act] [GoSFO, Act] [GoOAK, Act] • Start with restricted classical HTN form of partial program • A high-level action (HLA) • Has a set of possible immediate refinements into sequences of actions, primitive or high-level • Each refinement may have a precondition on its use • Executable plan space = all primitive refinements of Act [WalkToBART, BARTtoSFO, Act]

Lookahead with HLAs • To do offline or online planning with HLAs, need a model • See classical HTN planning literature

Lookahead with HLAs • To do offline or online planning with HLAs, need a model • See classical HTN planning literature – not! • Drew McDermott (AI Magazine, 2000): • The semantics of hierarchical planning have never been clarified … no one has ever figured out how to reconcile the semantics of hierarchical plans with the semantics of primitive actions

Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking

Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general

Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds

Downward refinement • Downward refinement property (DRP) • Every apparently successful high-level plan has a successful primitive refinement • “Holy Grail” of HTN planning: allows commitment to abstract plans without backtracking • Bacchus and Yang, 1991: It is naïve to expect DRP to hold in general • MRW, 2007: Theorem: If assertions about HLAs are true, then DRP always holds • Problem: how to say true things about effects of HLAs?

Angelic semantics for HLAs S S S s0 s0 s0 a4 a1 a3 a2 g g g • Begin with atomic state-space view, start in s0, goal state g • Central idea is the reachable set of an HLA from each state • When extended to sequences of actions, allows proving that a plan can or cannot possibly reach the goal • May seem related to nondeterminism • But the nondeterminism is angelic: the “uncertainty” will be resolved by the agent, not an adversary or nature State space h1h2is a solution h2 h1 h1h2 h2

Life: play and win in 20 trillion moves Stuart Russell

Life: play and win in 20 trillion moves Stuart Russell

Presentation Transcript

We play to win!

Play Count-downs Match and Win

Life: play and win in 20 trillion moves Stuart Russell

Stuart Russell Computer Science Division UC Berkeley

Life: play and win in 20 trillion moves

Play and Win Online Contests in Mauritius

Life: play and win in 20 trillion moves

Play, interpret together, play again and create a win-win-win

Play Rummy Online and Win Cash

Play cricket and win cash prizes

Play Cryptodice and Win ETH!

Life: play and win in 20 trillion moves

Life: play and win in 20 trillion moves

Stuart Russell Computer Science Division UC Berkeley

You Can’t Play 20 Questions with Nature and Win

Play the Starline Game and Win

Play and win stakes online

Play online, win bitcoins!

Play, Win, Earn

Play and Win Online Games in 2023

JiliAsia Pro Play and Win Consistently

Play Fortuna Casino: Play and Win without Depositing