290 likes | 424 Views
Security in Multiagent Systems by Policy Randomization. Praveen Paruchuri, Milind Tambe, Fernando Ordonez University of Southern California Sarit Kraus Bar-Ilan University,Israel University of Maryland, College Park. Motivation: The Prediction Game. An UAV (Unmanned Aerial Vehicle)
E N D
Security in Multiagent Systems by Policy Randomization Praveen Paruchuri, Milind Tambe, Fernando Ordonez University of Southern California Sarit Kraus Bar-Ilan University,Israel University of Maryland, College Park
Motivation: The Prediction Game • An UAV (Unmanned Aerial Vehicle) • Flies between the 4 regions • Can you predict the UAV-fly pattern ?? • Pattern 1 • 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4,…… • Pattern 2 • 1, 4, 3, 1, 1, 4, 2, 4, 2, 3, 4, 3,… (as generated by 4-sided dice) • Can you predict if 100 numbers in pattern 2 are given ?? • Randomization decreases Predictability • Increases Security
Problem Definition • Problem : Increase securityby decreasing predictability for agent/agent-team acting in uncertain adversarial environments. • Even if Policy Given, it is Secure • Efficient Algorithms for Reward/Randomness Tradeoff • Assumptions for Agent/agent-team: • Adversary is unobservable • Adversary’s actions/capabilities or payoffs are unknown • Assumptions for Adversary: • Knows the agents plan/policy • Exploits the action predictability • Can see the agent’s state (or belief state)
Solution Technique • Technique developed: • Intentionalpolicy randomization • MDP/POMDP framework • Sequential decision making • MDP Markov Decision Process • POMDP Partially Observable MDP • Increase Security=>Solve Multi-criteria problem for agents • Maximize action unpredictability (Policy randomization) • Maintain reward above threshold (Quality constraints)
Domains • Scheduled activities at airports like security check, refueling etc • Observable by anyone • Randomization of schedules helpful • UAV/UAV-team patrolling humanitarian mission • Adversary disrupts mission – Can disrupt food, harm refugees, shoot down UAV’s etc • Randomize UAV patrol policy
My Contributions • Two main contributions • Single Agent Case : • Formulate as Non linear program : Entropy based metric • Convert to Linear Program called BRLP • (Binary search for randomization) • Randomize single agent policies with reward > threshold • Multi Agent Case : RDR (Rolling Down Randomization) • Randomized policies for decentralized POMDPs • Threshold on team reward
MDP based single agent case • MDP is tuple < S, A, P, R > • S – Set of states • A – Set of actions • P – Transition function • R – Reward function • Basic terms used : • x(s,a) : Expected times action a is taken in state s • Policy (as function of MDP flows) :
Entropy : Measure of randomness • Randomness or information content : Entropy (Shannon 1948) • Entropy for MDP - • Additive Entropy – Add entropies of each state (π is a function of x) • Weighted Entropy – Weigh each state by it contribution to total flow where, alpha_j is the initial flow of the system
Tradeoff : Reward vs Entropy • Non-linear Program: Max entropy, Reward above threshold • Objective (Entropy) is non-linear • BRLP ( Binary Search for Randomization LP ) : • Linear Program • No entropy calculation, Entropy as function of flows
BRLP • Input and target reward (n% * maximum reward) • Poly-time convergence • Monotonicity: Entropy decreases or constant with increasing reward. • Control through • Input can be any high entropy policy • One such input is the uniform policy • Equal probability for all actions out of all states
LP for Binary Search • Policy as function of and • Linear Program
BRLP in Action Beta = .5 = 1 - Max entropy = 0 Deterministic Max Reward Target Reward
Results (Averaged over 10 MDPs) Max entropy : Expected Entropy Method : 10% avg gain over BRLP Fastest : BRLP : 7 fold average speedup over Expected Entropy
Multi Agent Case: Problem • Maximize entropy for agent teams subject to reward threshold • For agent team : • Decentralized POMDP framework used • Agents know initial joint belief state • No communication possible between agents • For adversary : • Knows the agents policy • Exploits the action predictability • Can calculate the agent’s belief state
RDR : Rolling Down Randomization • Input : • Best ( local or global ) deterministic policy • Percent of reward loss • d parameter – Number of turns each agent gets • Ex: d = .5 => Number of steps = 1/d = 2 • Each agent gets one turn (for 2 agent case) • Single agent MDP problem at each step • For agent 1’s turn : • Fix policy of other agents (Agent 2) • Find randomized policy • Maximizes joint entropy • ( w1 * Entropy(agent1) + w2 * Entropy(agent2) ) • Maintains joint reward above threshold
RDR : d = .5 Agent 1 Maximize joint entropy Joint Reward > 90% Max Reward Reward = 90% Agent 2 Maximize joint entropy Joint reward > 80% 80% of Max Reward
Experimental Results : Reward Threshold vs Weighted Entropy ( Averaged 10 instances )
Summary • Intentional randomization as main focus • Single agent case : • BRLP algorithm introduced • Multi agent case : • RDR algorithm introduced • Multi-criterion problem solved that • Maximizes entropy • Maintains Reward > Threshold
Thank You • Any comments/questions ??
Difference between safety and security ?? • Security: It is defined as the ability of the system to deal with threats that are intentionally caused by other intelligent agents and/or systems. • Safety : A system's safety is its ability to deal with any other threats to its goals.
Define Distributed POMDP • Dec-POMDP is a tuple <S,A,P,Ω,O,R>, where • S – Set of states • A – Joint action set <a1,a2,…,an> • P – Transition function • Ω – Set of joint observations • O- Observation function – Probability of joint observation given current state and previous joint action. Observations independent of each other • R – Immediate, Joint reward • A DEC-MDP is a DEC-POMDP with the restriction that at each time step the agents observations together uniquely determine the state.
Counterexample : Entropy • Lets say adversary shoots down UAV • Hence targets highest probable action --- Called Hit rate • Assume UAV has 3 actions. • 2 possible probability distributions • H ( 1/2, 1/2, 0 ) = 1 ( log base 2 ) • H ( 1/2 - delta, 1/4 + delta, 1/4 ) ~ 3/2 • Entropy = 3/2, Hit rate = 1/2-delta • Entropy = 1, Hit rate = 1/2 • Higher entropy but lower hit rate
d-parameter & Comments on Results • Effect of d-parameter (avg of 10 instances) RDR : Avg runtime in sec and (Entropy), T = 2 Conclusions: • Greater tolerance of reward loss => Higher entropy • Reaching maximum entropy tougher than single agent case • Lower miscoordination cost implies higher entropy • d parameter of .5 is good for practical purposes.
Entropies • For uniform policy – • 1 + ½ * 1 + 2 * ¼ * 1 + 4 * 1/8 * 1 = 2.5 • If initially deterministic policy and then uniform – • 0 + 1 * 1 + 2 * ½ * 1 + 4 * ¼ * 1 = 3 • Hence, uniform policies need not always be optimal.