INC 551 Artificial Intelligence

INC 551 Artificial Intelligence Lecture 9 Introduction to Machine Learning

What is machine learning (or computer learning)? ทางวัตถุประสงค์ คือการปรับตัวของ computer จาก ข้อมูลหนึ่งๆ ที่ป้อนเข้าไป ทางปฏิบัติ คือการหา function ที่เหมาะสมเพื่อ map input และ output

Definition of Learning A computer program is said to “learn” from experience E with respect to some class of tasks T and performance P, if its performance improves with experience E Tom Mitchell, 1997

To learn = To change parameters in the world model

Deliberative Agent Environment Agent Action Make Decision World Model Sense, Perceive How to create a world model that represents real world?

Car Model Throttle Amount (x) Speed (v)

Learning as function Mapping Find better function mapping Add +3 4 2 Performance error = 1 ปรับตัว Add +3 4.8 2 Performance error = 0.2

Learning Design Issues • Components of the performance to be learned • Feedback (supervised, reinforcement, unsupervised) • Representation(function, tree, neural net, state-action • model, genetic code)

Types of Learning • Supervised Learning • มีครูสอนบอกว่าอะไรดี อะไรไม่ดี เหมือนเรียนในห้อง • Reinforcement Learning • เรียนรู้แบบปฏิบัติไปเลย นักเรียนเลือกทำสิ่งที่อยากเรียนเอง • ครูคอยบอกว่าดีหรือไม่ดี • Unsupervised Learning • ไม่มีครูคอยบอกอะไรเลย นักศึกษาแยกแยะสิ่งดี ไม่ดี ออกเป็น 2 พวก • แต่ก็ยังไม่รู้ว่าอะไรดี ไม่ดี

Supervised Learning โดยทั่วไปจะมี data ที่เป็น Training set และ Test set Training set ใช้ในการเรียน, Test set ใช้ในการทดสอบ Data เหล่านี้จะบอกว่าอะไร เป็น Type A, B, C, … features Learner การเรียน Training set type features Learner Test set การใช้งาน type Answer

Graph Fitting Find function f that is consistent for all samples

Data Mapping x f(x) type 1 3.2 B 1.5 5.6 A 4 2.3 B 2 -3.1 B 7 4.4 A 5.5 1.1 B 5 4.2 A

1 2 3 ใช้ Least Mean Square Algorithm

Overfit Ockham’s razor principle “Prefer the simplest”

Least Mean Square Algorithm Let We can find the weight vector recursively using where n = current state μ= step size

MATLAB Example Match x,y pair x=[1 2 7 4 -3.2] y=[2.1 4.2 13.7 8.1 -6.5] Epoch 1

Neural Network Neuron model 1 brain = 100,000,000,000 neurons

Mathematical Model of Neuron

Activation Function Step function Sigmoid function

Network of Neurons • สามารถแบ่งเป็น 2 ชนิด • Single layer feed-forward network (perceptron) • Multilayer feed-forward network

Single layer Network x0 W0 y x1 W1 W2 x2 สมมติว่า activation function ไม่มี ซึ่งเป็น linear equation

ในกรณีที่ activation function เป็น step จะเหมือนเป็นเส้นตรงคอยแบ่งกลุ่ม ซึ่งเส้นตรงนี้จะแบ่งที่ไหน จะขึ้นกับค่า weight wj

Perceptron Algorithm มาจาก least-square method ใช้เพื่อปรับ weight ของ neuron ให้เหมาะสม α = learning rate

Multilayer Neural Network มี hidden layers เพิ่มเข้ามา Learn ด้วย Back-Propagation Algorithm

Back-Propagation Algorithm

Learning Progress ใช้ training set ป้อนเข้าไปหลายๆครั้ง แต่ละครั้งเรียก Epoch

Types of Learning • Supervised Learning • มีครูสอนบอกว่าอะไรดี อะไรไม่ดี เหมือนเรียนในห้อง • Reinforcement Learning • เรียนรู้แบบปฏิบัติไปเลย นักเรียนเลือกทำสิ่งที่อยากเรียนเอง • ครูคอยบอกว่าดีหรือไม่ดี • Unsupervised Learning • ไม่มีครูคอยบอกอะไรเลย นักศึกษาแยกแยะสิ่งดี ไม่ดี ออกเป็น 2 พวก • แต่ก็ยังไม่รู้ว่าอะไรดี ไม่ดี

Source: Reinforcement Learning: An Introduction Richard Sutton and Andrew Barto MIT Press, 2002

Supervised Learning Training Info = desired (target) outputs (features/class) Supervised Learning System Inputs Outputs Reinforcement Learning Evaluations (“rewards” / “penalties”) Environment RL System Inputs Outputs (“actions”)

Properties of RL • Learner is not told which actions to take • Trial-and-Error search • Possibility of delayed reward • Sacrifice short-term gains for greater long-term gains • The need to explore and exploit • Considers the whole problem of a goal-directed agent interacting with an uncertain environment

Model of RL Environment action state reward Agent Key components state, action, reward, and transition

State Action Next state Reward Agent: ฉันอยู่ที่ 134 เลือกทำการกระทำแบบที่ 12 Environment: เธอได้ 29 คะแนน และได้ไปอยู่ที่ 113 Agent: ฉันอยู่ที่ 113 เลือกทำการกระทำแบบที่ 8 Environment: เธอได้ 33 คะแนน และได้ไปอยู่ที่ 35 Agent: ฉันอยู่ที่ 35 เลือกทำการกระทำแบบที่ 8 Environment: เธอได้ 72 คะแนน และได้ไปอยู่ที่ 134 Agent: ฉันอยู่ที่ 134 เลือกทำการกระทำแบบที่ 4 Environment: เธอได้ 66 คะแนน และได้ไปอยู่ที่ 35 Agent: ฉันอยู่ที่ 35 เลือกทำการกระทำแบบที่ 2 Environment: เธอได้ 53 คะแนน และได้ไปอยู่ที่ 88 :

Example: Tic-Tac-Toe X X X X X O X X X O X X X O X X O X O O O X O O X O O O ... x x x ... ... ... x o o o x x } x’s move x ... ... ... ... ... } o’s move Assume an imperfect opponent: —he/she sometimes makes mistakes } x’s move x o x x o

1. Make a table with one entry per state: State V(s) – estimated probability of winning .5 ? .5 ? x . . . . . . 1 win x x x o o . . . . . . 0 loss x o o x o . . . . . . o o x 0 draw current state x x o x o o various possible next states * 2. Now play lots of games. To pick our moves, look ahead one step: Just pick the next state with the highest estimated prob. of winning — the largest V(s);

s s’

เหมือนกับ function mapping Table Generalizing Function State V State V s s s . . . s 1 2 3 N

Value Table Action State State State Value Table 1 dimension Action Value Table 2 dimension (V-table) (Q-table)

Examples of RL Implementations TD-Gammon Tesauro, 1992–1995 Action selection by 2–3 ply search Value TD error Start with a random network Play very many games against self Learn a value function from this simulated experience

Elavator Dispatching Crites and Barto, 1996 10 floors, 4 elevator cars STATES: button states; positions, directions, and motion states of cars; passengers in cars & in halls ACTIONS: stop at, or go by, next floor REWARDS: roughly, –1 per time step for each person waiting 22 Conservatively about 10 states

Issues in Reinforcement Learning • Trade-off between exploration and exploitation • ε – greedy • softmax • Algorithms to find the value function for delayed reward • Dynamic Programming • Monte Carlo • Temporal Difference

n-Armed Bandit Problem 1 2 3 4 5 Slot Machine Slot machine มีคันโยกอยู่หลายอันซึ่งให้รางวัลไม่เท่ากัน สมมติลองเล่นไปเรื่อยๆจนถึงจุดๆหนึ่ง ได้ข้อสรุปว่า เล่นคันโยก 1 26 ครั้ง ได้ รางวัล 4 baht/ครั้ง เล่นคันโยก 2 14 ครั้ง ได้ รางวัล 3 baht/ครั้ง เล่นคันโยก 3 10 ครั้ง ได้ รางวัล 2 baht/ครั้ง เล่นคันโยก 4 16 ครั้ง ได้ รางวัล 102 baht/ครั้ง

Exploration and Exploitation • จะมีปัญหา 2 อย่าง • จะลองเล่นคันโยกที่ 5 ต่อไปไหม • ค่าเฉลี่ยของรางวัลที่ผ่านมาเที่ยงตรงแค่ไหน เรียก Greedy Exploitation คือการใช้ในสิ่งที่เรียนมา คือเล่นอัน 4 ไปเรื่อยๆตลอด Exploration คือสำรวจต่อ โดยลองมากขึ้นในสิ่งที่ยังไม่เคยทำ Balance

ε-greedy Action Selection Greedy คือเลือกทางที่ให้ผลตอบแทนสูงสุด ε-greedy

Test: 10-armed Bandit Problem • n = 10 possible actions • Each is chosen randomly from a normal distribution: • each is also normal: • 1000 plays • repeat the whole thing 2000 times and average the results

Results

SoftMax Action Selection SoftMax จะเลือก greedy action ตามปริมาณของ reward Reward มาก ก็จะเลือก greedy action ด้วย probability สูง โดยคำนวณจาก Gibb-Boltzmann distribution “computational temperature”

INC 551 Artificial Intelligence