1 / 34

Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department

ECE-517: Reinforcement Learning in Artificial Intelligence Lecture 1: Course Logistics, Introduction. August 23, 2012. Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2012. But first, a quick anonymous survey ….

hedda
Download Presentation

Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE-517: Reinforcement Learning in Artificial IntelligenceLecture 1: Course Logistics, Introduction August 23, 2012 Dr. Itamar Arel College of Engineering Electrical Engineering and Computer Science Department The University of Tennessee Fall 2012

  2. But first, a quick anonymous survey …

  3. Outline • Course logistics and requirements • Course roadmap & outline • Introduction

  4. Course Objectives • Introduce the concepts & principles governing reinforcement- based machine learning systems • Review fundamental theory • Markov Decision Processes (MDPs) • Dynamic Programming (DP) • Practical systems; role of Neural Networks in NDP • RL learning schemes (Q-Learning, TD-Learning, etc.) • Limitations of existing techniques and how they can be improved • Discuss software and hardware implementation considerations Long-term goal: to contribute to your understanding of the formalism, trends and challenges in constructing RL-based agents

  5. Course Prerequisites • A course on probability theory or background in probability theory is required • Matlab/C/C++ competency • A Matlab tutorial has been posted on the course website (under the schedule page) • Open mindedness & imagination …

  6. Course Assignments • 2 small projects • Main goal – provide students with basic hands-on experience in ML behavioral simulation and result interpretation • Analysis complemented by simulation • MATLAB programming oriented • Reports should include all background, explanations and results • 5 problem assignment sets • Will cover the majority of the topics discussed in class • Assignments should be handed in before the beginning of the class • Final project • Each student/group is assigned a topic • Project report & in-class presentation

  7. Sony AIBO Lab • Located at MK 630 • 6 Sony AIBO dog robots (3rd generation) • Local wireless network (for communicating with the dogs) • Code for lab project/s will be written in Matlab • Interface has been prepared • Time slots should be coordinated withInstructor & TA

  8. Textbooks & Reference Material • Lecture notes will be posted weekly on the course website (web.eecs.utk.edu/~itamar/courses/ECE-517) as well as … • Updated schedule • Assignment sets, sample codes • Grades • General announcements • Reading assignments will be posted on schedule page • Textbook: • R. Sutton and A. Barto, “Reinforcement Learning: An Introduction,” 1998. (available online!)

  9. Grading policy & office hours • 2 Small Projects – 25% (12.5 points each) • 5 Assignment sets – 25% (5 points each) • Midterm (in-class) – 20% • Final project – 30% • Instructor: Dr. Itamar Arel • Office Hours: T/Tr 2:00 – 3:00 PM (MK 608) • TA: Derek Rose (derek@utk.edu) – office @ MK 606 • Office hours: contact TA • My email: itamar@eecs.utk.edu Students are strongly encouraged to visit the course website (web.eecs.utk.edu/~itamar/courses/ECE-517) for announcements, lecture notes, updates etc.

  10. UTK Academic Honesty Statement An essential feature of the University of Tennessee, Knoxville, is a commitment to maintaining an atmosphere of intellectual integrity and academic honesty. As a student of the university, I pledge that I will neither knowingly give nor receive any inappropriate assistance in academic work, thus affirming my own personal commitment to honor and integrity. Bottom line: DO YOUR OWN WORK!

  11. Understanding and constructing intelligence • What is intelligence? How do we defined/evaluate it? • How can we design adaptive systems that optimize their performance as time goes by? • What are the limitations of RL based algorithms? • How can artificial Neural Networks help us scale intelligent systems? • In what ways can knowledge be efficiently represented? • This course is NOT about … • Robotics • Machine learning (in the general sense) • Legacy AI – symbolic reasoning, logic, etc. • Image/vision/signal processing • Control systems theory • Why the course name “RL in AI” ?

  12. Course Outline & Roadmap • Introduction • Review of basic probability theory • Discrete-time/space probability theory • Discrete Markov Chains • Dynamic Programming • Markov Decision Processes (MDPs) • Partially Observable Markov Decision Processes (POMDPs) • Approximate Dynamic Programming (a.k.a. Reinforcement Learning) • Temporal Difference (TD) Learning, Planning • Midterm – Tuesday, Oct 9, 2012 • Neuro-Dynamic Programming • Feedfoward & Recurrent Neural Networks • Neuro-dynamic RL architecture • Applications and case studies • Final project presentations – Nov 27– Dec 4, 2012 A detailed schedule is posted at the course website

  13. Outline • Course logistics and requirements • Course outline & roadmap • Introduction

  14. What is Machine Learning? • Discipline focusing on computer algorithms that learn to perform “intelligent” tasks • Learning is based on observation of data • Generally: learning to do better in the future based on what has been observed/experienced in the past • ML is a core subarea of AI, which also intersects with physics, statistics, theoretical CS, etc. • Examples of “ML” Problems • Optical character recognition • Face detection • Spoken language understanding • Customer segmentation • Weather prediction, etc.

  15. Introduction • Why do we need good ML technology? • Human beings are lazy creatures … • Service robotics • $10B market in 2015 – Japan only! • Pattern recognition (speech, vision) • Data mining • Military applications • … many more • Many ML problems can be formulated as RL problems …

  16. Introduction (cont.) • Learning by interacting with our environment is probably the first to occur to us when we think about the nature of learning • Humans have no direct teachers • We do have direct sensormotor connection to the environment • We learn as we go along • Interaction with environment teaches us what “works” and what doesn’t • We construct a “model” of our environment This course explores a computational approach to learning from interaction with the environment

  17. What is Reinforcement Learning Reinforcement learning is learning what to do – how to map situations to actions – in order to maximize a long-term objective function driven by rewards. • It is a form of unsupervised learning • Two key components at the core of RL: • Trial-and-error – adapting internal representation, based on experience, to improve future performance • Delayed reward – actions are produced so as to yield long- term (not just short-term) rewards • The “agent” must be able to: • Sense its environment • Produce actions that can affect the environment • Have a goal (“momentary” cost metric) relating to its state

  18. What is Reinforcement Learning (cont’) • RL attempts to solve the Credit Assignment problem  what is the long-term impact of an action taken now • Unique to RL systems - major challenge in ML • Necessitates an accurate model of the environment being controlled/interacted-with • Something animals and humans do very well, and computers do very poorly • We’ll spend most of the semester formalizing solutions to this problem • Philosophically • Computationally • Practically (implementation considerations)

  19. The Big Picture Artificial Intelligence  Machine Learning  Reinforcement Learning • Types of Machine Learning • Supervised Learning: learn from labeled examples • Unsupervised Learning: process unlabeled examples • Example: clustering data into groups • Reinforcement Learning: learn from interaction • Defined by the problem • Many approaches are possible (including evolutionary) • Here we will focus on a particular family of approaches • Autonomous learning

  20. Software vs. Hardware • Historically, ML has been in CS turf • Confinement to the Von Neumann architecture • Software limits scalability … • Human brain has 1011 processors operatingat once • However, each runs at ~150 Hz • It’s the massive parallelism that gives it power • Even 256 processors is not “massive parallelism” • “Computer Engineering” perspective • FPGA devices (reconfigurable computing) • GPUs • ASIC Prospect • UTK/MIL group focus

  21. Exploration vs. Exploitation • A fundamental trade-off in RL • Exploitation of what worked in the past (to yield high reward) • Exploration of new, alternative action paths so as to learn how to make better action selections in the future • The dilemma is that neither exploration nor exploitation can be pursued exclusively without failing at the task • On a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward • We will review mathematical methods proposed to address this basic issue

  22. The Reinforcement Learning Framework

  23. Some Examples or RL • A master chess player • Planning-anticipating replies and counter-replies • Immediate, intuitive judgment • A mobile robot decides whether to enter a room or try to find its way back to a battery-charging station • Playing backgammon • Obviously, a strategy is necessary • Some luck involved (stochastic game) • In all cases, the agent tries to achieve a goal despite uncertainty about its environment • The affect of an action cannot be fully predicted • Experience allows the agent to improve its performance over time

  24. Origins of Reinforcement Learning • Artificial Intelligence • Control Theory (MDP) • Operations Research • Cognitive Science and Psychology • More recently, Neuroscience • RL has solid foundations and is a well-established research field

  25. Elements of Reinforcement Learning • Beyond the agent and the environment, we have the following four main elements in RL • Policy - defines the learning agent's way of behaving at any given time. Roughly speaking, a policy is a mapping from (perceived) states of the environment to actions to be taken when in those states. • Usually stochastic (adapts as you go along) • Enough to determine the agent’s behavior • Reward function - defines the goal in a RL learning problem. Roughly speaking, it maps each perceived state (or state-action pair) of the environment to a single number, a reward, indicating the intrinsic desirability of that state • Agent’s goal is to maximize the reward over time • May be stochastic • Drive the policy employed and its adaptation

  26. Elements of Reinforcement Learning (cont.) • Value function Whereas a reward function indicates what is good in an immediate sense, a value function specifies what is good in the long run. • Roughly speaking, the value of a state is the total amount of reward an agent can expect to accumulate over the future, starting from that state. • It allows the agent to look over the “horizon” • Actions are derived from value estimations, not rewards • We measure rewards, but we estimate and act upon values – corresponds to strategic/long-term thinking • Intuitively a prerequisite for intelligence/intelligent-control (plants vs. animals) • Obtaining a good value function is a key challenge in designing good RL systems

  27. Elements of Reinforcement Learning (cont.) • Model – an observable entity that mimics the behavior of the environment. • For example, given a state and action, the model might predict the resultant next state and next reward • As we will later discuss, predictability and auto-associative memory, are key attributes of the mammal brain • Models are used for planning – any way of deciding on a course of action by considering possible future scenarios prior to them actually occurring • Note that RL can work (sometimes very well) with an incomplete model • We’ll go over a range of model platforms to achieve the above As a side note: RL is essentially an optimization problem. However, it is one of the many optimization problems that are extremely hard to (optimally) solve.

  28. An Extended Example: Tic-Tac-Toe • Consider a classical tic-tac-toe game, whereby the winner places three marks in a row, horizontally, vertically or diagonally • Let’s assume: • We are playing against an imperfect player • Draws and losses are equally bad for us • Q: Can we design a player that’ll find imperfections in the opponent’s play and learn to maximize chances of winning? • Classical machine learning schemes would never visit a state that has the potential to lead to a loss • We want to exploit the weaknesses of the opponent, so we may decide to visit a state that has the potential of leading to a loss

  29. An Extended Example: Tic-Tac-Toe (cont.) • Using dynamic programming (DP), we can compute an optimal solution for any opponent • However, we would need specifications of the opponent (e.g. state-action probabilities) • Such information is usually unavailable to us • In RL we estimate this information from experience • We later apply DP, or other sequential decision making schemes, based on the model we obtained by experience • A policy tells the agent how to make its next move based on the state of the board • Winning probabilities can be derived by knowing opponent

  30. An Extended Example: Tic-Tac-Toe (cont.) • How do we solve this in RL … • Set up a table of numbers – one for each state of the game • This number will reflect the probability of winning from that particular state • This is treated as the state’s value, and the entire learned table denotes the value function • If V(a)>V(b)  state a is preferred over state b • All states with three X’s in a row have win prob. of 1 • All states with three O’s in a row have win prob. of 0 • All other states are preset to prob. 0.5 • When playing the game, we make a move that we predict would result in a state with the highest value (exploitation) • Occasionally, we chose randomly among the non-zero valued states (exploratory moves)

  31. An Extended Example: Tic-Tac-Toe (cont.)

  32. An Extended Example: Tic-Tac-Toe (cont.) • While playing, we update the values of the states • The current value of the earlier state is adjusted to be closer to the value of the later state V(s)  V(s) + a[V(s’) –V(s)] where: 0<a<1 is a learning parameter (step-size param.) s – state before move / s’ – state after move • This update rule is an example of Temporal-Difference Learning method • This method performs quite well – converges to the optimal policy (for a fixed opponent) • Can be adjusted to allow for slowly-changing opponents

  33. An Extended Example: Tic-Tac-Toe (cont.) • RL features encountered … • Emphasis on learning from interaction – in this case with the opponent • Clear goal – correct planning takes into account delayed rewards • For example setting up traps for a shortsighted opponent • No model of opponent exists a priori • Although this example is a good one, RL methods can … • be applied to infinite horizon problems (not state-terminal) • be applied to cases where there is no external adversary (e.g. “game against nature”) • Backgammon example  1020 states using Neural Nets

  34. For next class … • Read chapter 1 in SB (up to and including section 1.5) • Read chapter 2 in SB

More Related