300 likes | 496 Views
Information Theory of Decisions and Actions. Naftali Tishby and Daniel Polani. Contents. Guiding Questions Introduction. Guiding Questions. Q1. How is Fuster’s perception-action-cycle and Shannon’s information theory related? How is this analogy related with reinforcement learning ?
E N D
Information Theory of Decisions and Actions NaftaliTishby and Daniel Polani
Contents • Guiding Questions • Introduction
Guiding Questions • Q1. How is Fuster’sperception-action-cycle and Shannon’s information theory related? How is this analogy related with reinforcement learning? • Q2. What is value-to-go? What is information-to-go? How do we trade-off between these two terms? Give a formulation that can make this trade-off. Hint: Free-energy principle. How can we find the optimal policy, i.e. the one minimizing its information-to-go under a constraint on the attained value-to-go. • Q3. Define the entropy. Define the relative entropy or Kullback-Leibler divergence. Define the Markov decision process (MDP). Define the value function of the MDP. How is the value function optimized? What is Bellman equation and how is this related with the MDP problem? What’s the relationship between reinforcement learning, the MDP, and the Bellman equation? • Q4. Use a Bayesian network or graphical model (see Figure in page 12) to describe the perception-action cycle of an agent with sensors and memory. What are the characteristics of this agent?
Introduction • We need to develop intelligent behaviour for artificial agents, such as organisms. • The “cycle” view, such as perception-action cycle, can help identifying biases, incentives and constraints for the self-organized formation of intelligent processing in living organisms. • There are many modeling ways for quantitative treatment of the perception-action cycle. • Information-theoretic treatment for perception-action cycle can compare scenarios with differing computational models. • Markovian Decision Process (MDP) framework solves the problem of finding the optimal policy which maximizes the reward achieved by agents. • Goal of the paper is to marry the MDP formalism with an information-theoretic treatment of the processing cost required by the agent to attain a given level of performance.
Shannon’s Information Theory • What is the Shannon’s information theory? • A branch of applied mathematics, electrical engineering, and computer science involving the quantification of information. • A key measure of information is known as entropy, which is usually expressed by the average number of bits needed to store or communicate one symbol in message. • Entropy quantifies the uncertainty involved in predicting the value of random variable.
Shannon’s Information Theory • Entropy and Information • Entropy of a random variable • The entropy is a measure of uncertainty about the outcome of the random variable before it has been measured, or seen, and is a natural choice for this. • Attain maximum for uniform distribution, reflecting the state of maximal uncertainty.
Shannon’s Information Theory • Conditional entropyof random variables → The conditional entropy measures the remaining uncertainty about if is known.
Shannon’s Information Theory • Joint entropy of a random variables → Joint entropy is a measure of the uncertainty associated with a set of variables • Mutual information between and → Mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables.
Shannon’s Information Theory • Relative entropy (Kullback-Leibler divergence) • The relative entropy is a measure how much “compression” (or prediction, both in bits) could be gained if instead of an hypothesized distribution of , a concrete distribution is utilized. • One haswith equality if and only if everywhere. • The relative entropy can become infinite if for an outcome that can occur with nonzero probability one assumes a probability . • The mutual information between two variables and can be expressed as
Markov Decision Processes • MDP: Definition • Discrete time stochastic control process • Basic model for the interaction of an organism (or an artificial agent) with a stochastic environment • The core problem of MDPs is to find a “policy” for the decision maker • The goal is to choose a policy that will maximize some cumulative function of the random rewards, typically the expected discounted sum over a potentially infinite horizon.
Markov Decision Processes • Given a state set , and for each state an action set , an MDP is specified by the tuple , defined for all and :the probability that performing an action in a state will move the agent to state : the expected reward for this particular transition
Markov Decision Processes • Value function of MDP and its optimization • Policy specified an explicit probability to select action if the agent in a state • Total cumulated reward • Future expected cumulative reward value (Bellman Equation) • Per-action value function Q which is expanded from value function V
Markov Decision Processes • Bellman equation • A dynamic decision problem Constraint • Bellman’s principle of optimality An optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision • Bellman equation Constraint
Markov Decision Processes • Reinforcement learning • If the probabilities or rewards are unknown, the problem is one of reinforcement learning • For this purpose it is useful to define a further function, which corresponds to taking the action and then continuing optimally.
Markov Decision Processes • Problem • The MDP framework is concerned with describing the task and with solving the problem of finding the optimal policy (value-to-go) • It is not concerned with the actual processing cost that is involved with carrying out the given policies (Information-to-go)
Guiding Questions • Q1. How is Fuster’sperception-action-cycle and Shannon’s information theory related? How is this analogy related with reinforcement learning? • Q3. Define the entropy. Define the relative entropy or Kullback-Leibler divergence. Define the Markov decision process (MDP). Define the value function of the MDP. How is the value function optimized? What is Bellman equation and how is this related with the MDP problem? What’s the relationship between reinforcement learning, the MDP, and the Bellman equation?
Bayesian Network • Bayesian network of general agent : World state : sensor of agent : memory of agent : Action
Bayesian Network • Characteristics of agent • Agent can be considered as an all-knowing observer. • Agent can access full states to world state • Memory of reactive agent will be ignored. • Apply previous comment to graph
Value to go / Information to go • Value-to-go • Future expected reward in the course of a behaviour sequence towards a goal • Information-to-go • Cumulated information processing cost or bandwidth required to specify the future decision and action sequence • Trade-off • In a view of biological ramification, organism finds optimal rewards that an organism can accumulate under given constraints on its informational bandwidth • Howmuch reward the organism can accumulate vs. how much informational bandwidth it needs for that
Information-to-go • Formalism • The cumulated informationprocessing costor bandwidth required to specify the future decision and action sequence • This is computed specifying a given starting state and initial action and accumulating information-to-go into the open-ended future • Let is fixed prior on the distribution of successive states and actions • Define now the process complexity as the Kullback-Leibler divergence between actual distribution of states and actions after t and the one assumed in the prior
Information-to-go • Formalism • Since are independent so Action distributions are consistent with them via the policy which we assume constant over time for all t
Information to go • Formalism With
Calculating trade-off • Using Lagrange method • The constrained optimization problem of finding minimal information-to-go at a given level of value-to-go can be turn into an unconstrained one. • Let the Lagrange multiplier as • Lagrangianbuild a link to the Free Energy formalism corresponds to the physical entropy corresponds to the energy of system • This provides additional justification for the minimization of the information-to-go under value-to-go constraints • Minimization of identifies the least committed policy in the sense that the future is the least informative
Calculating trade-off • Using Lagrange method • To find the optimal policy, where the minimization ranges over all policies • To resolve this equation
Calculating trade-off • Using Lagrange method • Extending above equation by Lagrange term for the normalization of and taking the gradient with respect to and then setting the gradient of to 0 provides
Calculating trade-off • Using Lagrange method • Iterating the system of self-consistent above equations till convergence for every state will produce an optimal policy.