420 likes | 522 Views
Adaptive Sequential Decision Making with Self-Interested Agents. David C. Parkes Division of Engineering and Applied Sciences. Harvard University. http://www.eecs.harvard.edu/econcs. Wayne State University October 17, 2006. Context. Multiple agents Self-interest
E N D
Adaptive Sequential Decision Making with Self-Interested Agents David C. Parkes Division of Engineering and Applied Sciences Harvard University http://www.eecs.harvard.edu/econcs Wayne State University October 17, 2006
Context • Multiple agents • Self-interest • Private information about preferences, capabilities • Coordinated decision problem • social planner • auctioneer
This talk: Sequential Decision Making • Multiple time periods • Agent arrival and departure • Values for sequences of decisions • Learning by agents and the “center” • Example scenarios: • allocating computational/network resources • sponsored search • last-minute ticket auctions • bidding for shared cars, air-taxis,… • …
Markov Decision Process Pr(st+1|at,st) at st st+2 st+1 r(at,st) + Self-interest
Online Mechanisms actions M=(,p) t: S! A pt: S! Rn agent reports • Each period: • agents report state/rewards • center picks action, payments • Main question: • what policies can be implemented in a game-theoretic equilibrium? payments
Outline • Multi-armed Bandits Problem [agent learning] • canonical, stylized learning problem from AI • introduce a multi-agent variation • provide a mechanism to bring optimal coordinated learning into an equilibrium • Dynamic auction problem [center learning] • resource allocation (e.g. WiFi) • dynamic arrival & departure of agents • provide a truthful, adaptive mechanism
Multi-Armed Bandit Problem • Multi-armed bandit (MAB) problem • n arms • Each arm has stationary uncertain reward process • Goal: implement a (Bayesian) optimal learning policy
Tractability: Gittins’ result • Theorem [Gittins & Jones 1974]: The complexity of computing an optimal joint policy for a collection of n Markov Chains is linear in n. • There exist independent index functions such that the MC with highest “Gittins index” at any given time should be activated. • Can compute as optimal MDP value to “restart-in-i” MDP, solve using LP (Katehakis & Veinott’87)
Self-Interest + MABP • Multi-armed bandit (MAB) problem • n arms • Each arm has stationary uncertain reward process • Goal: implement a (Bayesian) optimal learning policy
Self-Interest + MABP • Multi-armed bandit (MAB) problem • n arms (arm == agent) • Each arm has stationary uncertain reward process, (privately observed) • Goal: implement a (Bayesian) optimal learning policy Mechanism
reward A1 A2 A3 reward
Review: The Vickrey Auction • Rules: “sell to highest bidder at second- highest price” • How should you bid? Truthfully! • Alice wins for $8 Alice: $10 Bob: $8 Carol: $6 mr.robot
Review: The Vickrey Auction • Rules: “sell to highest bidder at second- highest price” • How should you bid? Truthfully! • Alice wins for $8 Alice: $10 Bob: $8 Carol: $6 mr.robot (dominant-strategy equilibrium)
Conjecture: Agents will bid Gittins index for arm in each round. Intuition? First Idea Vickrey auction
Not truthful! • Agent 1 may have knowledge that the mean reward for arm 2 is smaller than agent 2’s current Gittins index. • Learning by 2 would decrease the price paid by 1 in the future ) 1 should under-bid
Second Idea • At every time-step: • Each agent reports claim about Gittins index • Suppose b1¸ b2¸ … ¸ bn • Mechanism activates agent 1 • Agent 1 reports reward, r1 • Mechanism pays r1 to each agent 1 • Theorem: Truthful reporting is a Markov-Perfect equilibrium, and mechanism implements optimal Bayesian learning.
Learning-Gittins VCG • At every time-step: • Activate Agent with highest bid. • Pay the reward received by activated agent to all others • Collect from every agent i, expected value agents i would receive without i in system • Sample hypothetical execution path(s), using no reported state information. • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.
Learning-Gittins VCG (CPS’06) • At every time-step: • Activate Agent with highest bid. • Pay the reward received by activated agent to all others • Collect from every agent i, expected value agents i would receive without i in system • Sample hypothetical execution path(s), using no reported state information. • Theorem: Mechanism is truthful, system-optimal, ex ante IR, and ex ante strong budget-balanced in MPE.
where X-iis the total expected value agents other than i would have received in this period if i weren’t there.
Outline • Multi-armed Bandits Problem [agent learning] • canonical, stylized learning problem from AI • introduce a multi-agent variation • provide a mechanism to bring optimal coordinated learning into an equilibrium • Dynamic auction problem [center learning] • resource allocation (e.g. WiFi) • dynamic arrival & departure of agents • provide a truthful, adaptive mechanism, that converges towards an optimal decision policy
A3 } } A1,A2 } st st+1 st+2 st+3 A4 First question: what policies can be truthfully implemented in this environment, where agents can misreport private information?
9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1 Manipulation?
9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1
9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) Illustrative Example Selling a single right to access WiFi in each period Agent: (ai,di,wi) ) value wi for allocation in t2[ai,di] Scenario: Second-price: Sell to A1 for $2, then A2 for $1 Naïve Vickrey approach fails!
9am A1 (9,11,$3), A2 (9,11,$2) 10am A3 (10,11,$1) (NPS’02) Mechanism Rule: Greedy policy, collect “critical value payment”, i.e. the smallest value can bid and still be allocated. ) Sell to A1, collect $1. Sell to A2, collect $1. Theorem. Truthful, and implements a 2-approximation allocation, when no-early arrivals and no-late departures.
Key Intuition: Monotonicity (HKMP’05) Monotonic: i(vi,v-i) = 1 )i(v’i,v-i)=1 for higher bid w’i¸wi, more relaxed [a’i,d’i]¶[ai,di] win p’ p p time lose a a’ d’ d
Single-Valued Domains • Type i=(ai,di,[ri,Li]) • Value ri for decision kt2Li, or kt2LjÂLi • Examples: • “single-minded” online combinatorial auctions • WiFi allocation with fixed lengths of service • Monotonic: higher r, smaller L, earlier a, later d • Theorem: monotonicity is necessary and sufficient for truthfulness in SV domains.
A3 } } A1,A2 } st st+1 st+2 st+3 A4 Second question: how to compute monotonic policies in stochastic, SV domains? How to allow learning (by center)?
Basic Idea 0 1 2 3 T0 T1 T2 T3 … • Model-Based Reinforcement Learning • Update model in each epoch • Planning: compute new policy 0, 1, … • Collect critical value payments • Key Components: 1. Ensure policies aremonotonic 2. Method to compute critical-value payments 3. Careful updates to model.
1. Planning: Sparse-Sampling h0 Sparse-sampling() w L depth-L sampled tree, each node is state, each node’s children obtained by sampling each action w times, back-up estimates to root. Monotonic? Not Quite.
Achieving Monotonicity: Ironing • Assume a maximal patience, • Ironing: if ss allocates to (ai,di,ri,Li) in period t then check ss would allocate to (ai,di+,ri,Li) • NO: block(ai,di,ri,Li) allocation • YES: allow allocation • Also use “cross-state sampling” to be aware of ironing when planning.
2. Computing payments: Virtual Worlds ’1: value ! vc(t0)- VW1 A1 wins A2 wins … t0 t1 t2 t3 VW2 ’2: value ! vc(t1)- + method to compute vc(t) in any state st
3. Delayed Updates 0 1 2 3 T0 T1 T2 T3 … • Consider critical payment for an agent ai<T1<di • Delayed-updates: only include departed agents in revised 1 • Ensures policy is agent-independent
Complete procedure • In each period: • maintain main world • maintain virtual world without each agent active + allocated • For planning: • use ironing to cancel an action • cross-state sparse-sampling to improve policy • For pricing: • charge minimal critical value across virtual worlds • Periodically: move to a new model (and policy) • only use departed types • Theorem: truthful (DSE), adaptive policy for single-valued domains.
Future: Online CAs • Combinatorial auctions (CAs) well studied and used in practice (e.g. procurement) • Challenge problem: Online CAs • Two pronged approach: • computational (e.g. leveraging recent work in stochastic online combinatorial optimization by Pascal Van Hentenryck, Brown) • incentive considerations (e.g. finding appropriate relaxations of dominant strategy truthfulness to the online domain)
Summary • Online mechanisms extend traditional mechanism design to consider dynamics (both exogeneous, e.g. supply and endogeneous) • Opportunity for learning: • by agents. Multi-agent MABP • demonstrated use of payments to bring optimal learning into an equilibrium • by center. Adaptive online auctions • demonstrated use of payments to bring expected-value maximizing policies into an equilibrium • Exciting area. Lots of work still to do!
Thanks • Satinder Singh, Jonathan Bredin, Quang Duong, Mohammad Hagiaghayi, Adam Juda, Robert Kleinberg, Mohammad Mahdian, Chaki Ng, Dimah Yanovsky. • More information www.eecs.harvard.edu/econcs