840 likes | 999 Views
Control and Decision Making in Uncertain Multi-agent Hierarchical Systems A Case Study in Learning and Approximate Dynamic Programming. PI Meeting August 1 st , 2002 Shankar Sastry University of California, Berkeley. Outline. Hierarchical architecture for multiagent operations
E N D
Control and Decision Making in Uncertain Multi-agent Hierarchical Systems A Case Study in Learning and Approximate Dynamic Programming PI Meeting August 1st, 2002 Shankar Sastry University of California, Berkeley
Outline Hierarchical architecture for multiagent operations Confronting uncertainty Partial observation Markov games (POMgame) Model predictive techniques for dynamic replanning
Partial-observation Probabilistic Pursuit-Evasion Game(PEG) with 4 UGVs and 1 UAV Fully autonomous operation
position of targets • position of obstacles • positions of agents Exogenous disturbance Strategy Planner Map Builder Hierarchy in Berkeley Platform Communications Network desired agents actions targets detected agents positions obstacles detected tactical planner Tactical Planner & Regulation Vehicle-level sensor fusion Uncertainty pervades every layer! obstacles detected trajectory planner state of agents regulation • obstacles • detected • targets • detected inertial positions height over terrain actuator positions • lin. accel. • ang. vel. control signals actuator encoders vision ultrasonic altimeter INS GPS Terrain UAV dynamics UGV dynamics Targets
POMGAME Representing and Managing Uncertainty • Uncertainty is introduced in various channels • Sensing -> unable to determine the current state of world • Prediction -> unable to infer the future state of world • Actuation ->unable to make the desired action to properly affect the state of world • Different types of uncertainty can be addressed by different approaches • Nondeterministic uncertainty : Robust Control • Probabilistic uncertainty : (Partially Observable) Markov Decision Processes • Adversarial uncertainty : Game Theory
Markov Games • Framework for sequential multiagent interaction in an Markov environment
Policy for Markov Games • The policy of agent i at time t is a mapping from the current state to probability distribution over its action set. • Agent i wants to maximize • the expected infinite sum of a reward that the agent will gain by executing the optimal policy starting from that state • where is the discount factor, and is the reward received at time t • Performance measure: • Every discounted Markov game has at least one stationary optimal policy, but not necessarily a deterministic one. • Special case : Markov decision processes (MDP) • Can be solved by dynamic programming
Policy for POMGames • The agent i wants to receive at least • Poorly understood: analysis exists only for very specially structured games such as a game with a complete information on one side • Special case : partially observable Markov decision processes (POMDP)
Experimental Results: Pursuit Evasion Games with 4UGVs (Spring’ 01)
Experimental Results: Pursuit Evasion Games with 4UGVs and 1 UAV (Spring’ 01)
Pursuit-Evasion Game Experiment • PEG with four UGVs • Global-Max pursuit policy • Simulated camera view • (radius 7.5m with 50degree conic view) • Pursuer=0.3m/s Evader=0.5m/s MAX
Pursuit-Evasion Game Experiment • PEG with four UGVs • Global-Max pursuit policy • Simulated camera view • (radius 7.5m with 50degree conic view) • Pursuer=0.3m/s Evader=0.5m/s MAX
Experimental Results: Evaluation of Policies for different visibility Capture time of greedy and glo-max for the different region of visibility of pursuers 3 Pursuers with trapezoidal or omni-directional view Randomly moving evader • Global max policy performs better than greedy, since the greedy policy selects movements based only on local considerations. • Both policies perform better with the trapezoidal view, since the camera rotates fast enough to compensate the narrow field of view.
Experimental Results: Evader’s Speed vs. Intelligence Capture time for different speeds and levels of intelligence of the evader 3 Pursuers with trapezoidal view & global maximum policy Max speed of pursuers: 0.3 m/s • Having a more intelligent evader increases the capture time • Harder to capture an intelligent evader at a higher speed • The capture time of a fast random evader is shorter than that of a slower random evader, when the speed of evader is only slightly higher than that of pursuers.
Game-theoretic Policy Search Paradigm • Solving very small games with partial information, or games with full information, are sometimes computationally tractable • Many interesting games including pursuit-evasion are a large game with partial information, and finding optimal solutions is well outside the capability of current algorithms • Approximate solution is not necessarily bad. There might be simple policies with satisfactory performances -> Choose a good policy from a restricted class of policies ! • We can find approximately optimalsolutions from restricted classes, using a sparse sampling and a provably convergent policy search algorithm
Constructing A Policy Class • Given a mission with specific goals, we • decompose the problem in terms of the functions that need to be achieved for success and the means that are available • analyze how a human team would solve the problem • determine a list of important factors that complicate task performance such as safety or physical constraints • Maximize aerial coverage, • Stay within a communications range, • Penalize actions that lead an agent to a danger zone, • Maximize the explored region, • Minimize fuel usage, …
Policy Representation • Quantitize the above features and define a feature vector that consists of the estimate of above quantities for each action given agents’ history • Estimate the ‘goodness’ of each action by constructing where is the weighting vector to be learned . • Choose an action that maximizes . • Or choose a randomized action according to the distribution Degree of Exploration
Policy Search Paradigm • Searching for optimal policies is very difficult, even though there might be simple policies with satisfactory performances. • Choose a good policy from a restricted class of policies ! • Policy Search Problem
PEGASUS (Ng & Jordan, 00) • Given a POMDP , • Assuming a deterministic simulator, we can construct an equivalent POMDP with deterministic transitions . • For each policy p2P for X, we can construct an equivalent policy p02P0 for X0 such that they have the same value function, i.e. VX (p) = VX0 (p0) . • It suffices for us to find a good policy for the transformed POMDP X0 . • Value function can be approximated byadeterministic function, and ms samples aretaken and reused to compute the value function for each candidate policy. --> Then we can use standard optimization techniquesto search for approximately optimal policy.
Performance Guarantee & Scalability • Theorem • We are guaranteed to have a policy with the value close enough to the optimal value in the class P.
Acting under Partial Observations • Computing the value function is very difficult under partial observations. • Naïve approaches for dealing with partial observations: • State-free deterministic policy : mapping from observation to action • Ignores partial observability (i.e., treat observations as if they were the states of the environment) • Finding an optimal mapping is NP-hard. Even the best policy can have very poor performance or can cause a trap. • State-free stochastic policy : mapping from observation to probability distribution over action • Finding an optimal mapping is still NP-hard. • Agents still cannot learn from the reward or penalty received in the past.
Example:Abstraction of Pursuit-Evasion Game • Consider a partial-observation stochastic pursuit-evasion game in a 2-D grid world, between (heterogeneous) teams of ne evaders and np pursuers . • At each time t, • Each evader and pursuer, located at and respectively, • takes the observation over its visibility region • updates the belief state • chooses action from • Goal: capture of the evader, or survival
Example: Policy Feature • Maximize collective aerial coverage -> maximize the distance between agents where is the location of pursuer that will be landed by taking action from • Try to visit an unexplored region with high possibility of detecting an evader where is a position arrived by the action that maximizes the evader map value along the frontier
Example: Policy Feature (Continued) • Prioritize actions that are more compatible with the dynamics of agents • Policy representation
Benchmarking Experiments • Performance of two pursuit policies compared in terms of capture time • Experiment 1 : two pursuers against the evader who moves greedily with respect to the pursuers’ location • Experiment 2 : When we supposed the position of evader at each step is detected by the sensor network with only 10% accuracy, two optimized pursuers took 24.1 steps, while the one-step greedy pursuers took over 146 steps in average to capture the evader in 30 by 30 grid.
AerodynamicAnalysis longitudinal flapping lateral flapping main rotor collective pitch tail rotor collective pitch Body Velocities Angular rates Servoinputs throttle Coordinate Transformation Augmented Servodynamics Modeling RUAV Dynamics Tractable Nonlinear Model Position Spatial velocities Angles Angular rates
Benchmarking Trajectory PD controller Nonlinear, coupled dynamicsare intrinsic characteristics in pirouette and nose-in circle trajectories. Example PD controller fails to achieve nose-in circle type trajectories.
Reinforcement Learning Policy Search Control Design • Aerodynamics/kinematics generates a model to identify. • Locally weighted Bayesian regression is used for nonlinear stochastic identification: we get the posterior distribution of parameters, and can easily simulate the posterior predictive distribution to check the fit and robustness. • A controller class is defined from the identification process and physical insights and we apply policy search algorithm . • We obtain approximately optimal controller parameters by reinforcement learning, I.e. training using the flight data and the reward function. • Considering the controller performance with a confidence interval of the identification process, we measure the safety and robustness of control system.
Performance of RL Controller Assent & 360° x2 pirouette Manual vs. Autonomous Hover
maneuver3 pirouette maneuver1 maneuver2 Toughest Maneuvers for Rotorcraft UAVs Nose-in During circling Heading kept the same • Any variation of the following maneuvers in x-y direction • Any combination of the following maneuvers
From PEG to More Realistic Battlefield Scenarios • Adversarial attack • Reds just do not evade, but also attack -> Blues cannot blindly pursue reds. • Unknown number/capability of adversary -> Dynamic selection of the relevant red model from unstructured observation • Deconfliction between layers and teams • Increase number of feature -> Diversify possible solutions when the uncertainty is high
Why General-sum Games? "All too often in OR dealing with military problems, war is viewed as a zero-sum two-person game with perfect information. Here I must state as forcibly as I know that war is not a zero-sum two-person game with perfect information. Anybody who sincerely believes it is a fool. Anybody who reaches conclusions based on such an assumption and then tries to peddle these conclusions without revealing the quicksand they are constructed on is a charlatan....There is, in short, an urgent need to develop positive-sum game theory and to urge the acceptance of its precepts upon our leaders throughout the world." Joseph H. Engel, Retiring Presidential Address to the Operations Research Society of America, October 1969
General-sum Games • Depending on the cooperation between the players, • Noncooperative • Cooperative • Depending on the least expected payoff that a player is willing to accept- Nash’s special/general bargaining solution • By restricting the blue and red policy class to be the finite size, we reduce the POMGame into the bimatrix game.
From POMGame To Bimatrix Game Bimatrix game usually has multiple Nash equilibria, with different values.
Elucidating Adversarial Intention • The model posterior distribution can be used to predict the future observation, or select the model. • Then the blue team can employ the policy such that • Example Implemented : tracking unknown number of evaders with unknown dynamics with noisy sensors
Dynamic Bayesian Model Selection • Dynamic Bayesian model selection (DBMS) is a generalized model selection approach to time series data of which the number of components can vary with time • If K is the number of the components at any instance and T is the length of the time series, then there are O(2KT) possible models which demands an efficient algorithm • The problem is formulated using Bayesian hierarchical modeling and solved using reversible jump MCMC methods suitably adapted.
DBMS: Graphical Representation • a – Dirichlet prior • A – Transition matrix for mt • dt – Dirichlet prior • wt – component weights • zt – allocation variable • F – transition dynamics
Estimated target position Observation +True target trajectory
Estimated target position Observation +True target trajectory
Vision-based Landing of an Unmanned Aerial Vehicle Berkeley Researchers: Rene Vidal, Omid Shakernia, Shankar Sastry
What we have accomplished • Real-time motion estimation algorithms • Algorithms: Linear & Nonlinear two-view, Multi-view • Fully autonomous vision-based control/landing
UAV Pan/Tilt Camera Onboard Computer Vision System Hardware • Ampro embedded Little Board PC • Pentium 233MHz running LINUX • Motion estimation, UAV high-level control • Pan/Tilt/Zoom camera tracks target • Motion estimation algorithms • Written C++ using LAPACK • Estimate relative position and orientation at 30 Hz • Sends control to navigation computer at 10 Hz