150 likes | 310 Views
Hidden Markov Model Multiarm Bandits: A Methodology for Beam Scheduling in Multitarget Tracking. Authors: Vikram Krishnamurthy & Robin Evans. Presented by Shihao Ji Duke University Machine Learning Group June 10, 2005. Outline. Motivation Overview Multiarmed Bandits HMM Multiarmed Bandits
E N D
Hidden Markov Model Multiarm Bandits:A Methodology for Beam Scheduling in Multitarget Tracking Authors: Vikram Krishnamurthy & Robin Evans Presented by Shihao Ji Duke University Machine Learning Group June 10, 2005
Outline • Motivation • Overview • Multiarmed Bandits • HMM Multiarmed Bandits • Experimental Results
Motivation • ESA has only one steerable beam. • The coordinates of each target evolve according to a finite state Markov chain. • Question: which single target should the tracker choose to observe at each time instant in order to optimize some specified cost function?
Multiarmed Bandits • The Model One has N parallel projects, indexed i=1,2,…,N and at each instant of discrete time can work on only a single project. Let the state of project i at time k be denoted . If one works on project i at time k then one pays an immediate expected cost of . The state changes to by a Markov transition rule (which may depend upon i, but not upon t), while the state of the projects one has not touched remain unchanged: for .The problem is how to allocate one’s effort over projects sequentially in time so as to minimize expected total discounted cost.
Gittins Index • Simplest non-trivial problem, classic • No essential solution until Gittins and his co-workers. • They proved that to each project i one could attach an index, ,such that the optimal action at time k is to work on that project for which the current index is smallest. The index is calculated by solving the problem of allocating one’s effort optimally between project i and a standard project which yields a constant cost. • Gittins’ result thus reduces the case of general N to that of the case N = 2.
HMM Multiarmed Bandits • The “standard” multiarmed bandits problem involves a fully observed finite state Markov chain and is only a MDP with a rich structure. • For the multitarget tracking, due to measurement noise at the sensor, the states are only partially observable. Thus, the multitarget tracking problem needs to be formulated as a multiarmed bandits involving HMMs (with the HMM filter to estimate the information state). • Can be solved brute forcedly by POMDP, but it involves a much higher (enormous) dimensional Markov chain. • Bandit assumption decouples the problem.
Bandit Assumption • The information state of currently observed target updates by the HMM filter: • For the other P-1 unobserved target, their information states are kept frozen: if target q is not observed
Why it is Valid? • Slow Dynamics: slowly moving targets have a bandit structure. where • Decoupling Approximation: without the bandit assumption, the optimal solution is intractable. Bandit model is perhaps the only reasonable approximation that leads to computationally tractable solution. • Reinitialization: a compromise. Reinitialize the HMM multiarmed bandits at regular intervals with updated estimates from all targets.
Some details • Finite State Markov Assumption: denotes the quantized distance of the pth target from base station, and the target distance evolves according to a finite-state Markov chain. • Cost structure: typically depends on the distance of the pth target to the base station, i.e., the target gets close to the base station pose a greater threat and given higher priority by the tracking algorithm. • Objective function:
Optimal Solution • For the bandit assumption, the optimal solution has an indexable (decoupling) rule, that is, the optimization can be decoupled into P independent optimization problems. • For each target p, there is a function (Gittins index) . Solved by POMDP algorithms, see the next slide. • The optimal scheduling policy at time k is to steer the beam toward the target with the smallest Gittins index
Gittins Index • For arbitrary multiarmed bandits problem, the Gittins index can be calculated by solving an associated infinite horizon discounted control problem called the “return to state”. • For the target p, given information state at time k, there are two actions: 1) Continue, which incurs a cost and evolves according to HMM filter; 2) Restart, which moves to a fixed information state , incurs a cost , and evolves according to HMM filter.
The Gittins index of the state of target p is given by where satisfies the Bellman equation:
POMDP solver • Defining new parameters (see eq.15), • Can be solved by any standard POMDP solver: such as sondik’s algorithm, witness algorithm, incremental-prune, or suboptimal (approximated) algorithms.