Spring 2011
Artificial Intelligence
COSC 40503
WEEK 2
Antonio Sanchez
Texas Christian University
Credit Assignment and Connectionism • For all the actions the environment delivers one single composite feedback • The automata must distribute the prize or punishement among the network, generating a credit assignment model • In this way the automata generates an adequate internal pattern of behavior • This is we call Learning • There are many methodologies to model such behavior • Here we shall use CLS (Collective Learning Systems)
So Far …. Artificial Intelligence deals with knowledge and learning Artificial learning is obtained by • Traversing knowledge bases(rule based and logical programming) • Artificial selection(genetic and evolutionary algorithms) • Adaptive methods(connectionism and feedback) Adaptive behavior studies have their roots in Pavlov’s studies of animal conditioning
Two recurring concepts Feedback Connectionism Perceptron (Rosenblatt, Selfridge) Learning Automata (Tsetlin, Narendra, Barto) Neural Networks (Rummelhart, McClelland) Collective Learning (Samuel, Michie, Bock) Cybernetic Loop (Wiener, Rosenblueth, Ashby) RP Policies (Thathachar,Viswanathan,Fu) Backpropagation (Skjenoswsky, Hopefield) Algedonic Loop (Beer, Bock) Credit Assignment Interacting with the Environment
CLS Formalization • CLS = [ AUTOMATA, MA ] • Where AUTOMATA = { I, O, STM, A } • I : Is a vector of possible entries or stimuli • O : Is a vector of possible responses or actions • STM : Is the transition matrix where the Probability Pij of choosing Response Oj is stored for each Stimulus Ii • A : Is an Alegedonic algorithm (punishment / reward) the modifies the distinct Pij according to the compensation policy of the automaton, and it is precisely this algorithm that represents learning • MA: Is the Environment that emits a series of stimuli I and evaluates the responses O of the AUTOMATA, that serves to determine the values applied to Pij across the algorithm A, and the matrix STM.
1/2 1/4 0 0 1/4 0 0 3/4 1/8 0 1/8 Best moves possible moves prescription description CLS mapping After the game is over and the winner is determined the compensation method modifies probabilities and the STM becomes more prescriptive (knowledge) rather than just descriptive (information) Possible moves Other player’s turn Selection method Initial method Second turn Looking at the options the CLS selects Its moves and gives the board to the other player After the other selects a move, the CLS takes a second move and so on until the game is over Compensation method
// Selection Process (mode: random or max ) int selection (STM, situation, mode) { int selectionMade; float cumulative, number ; If ( mode == "max" ) { max = 1; for ( j= 1; j <= outputs ; j=j+1 ) If ( STM[situation][j] > STM[situation][max]) { max = j ; selectionMade = max } else { number = random(seed); cumulative = 0; selectionMade = 0 ; j= 0; while ( selectionMade <> 0 ) { j= j + 1; cumulative = cumulative + STM[situation,j] ; If ( cumulative > number ) selectionMade = j } return selectionMade ; } ; CLS Pseudo Code import java.awt.*; import java.applet.Applet; public class cls extends Applet float [][] STM = new float [entries][outputs] ; int [][] LOG = new int[ent_max][2] ; String game = new String; String mode = new String; int turn, i,k,j, situation, play, turn, times ; // Basic Loop public void init() { for (times = 1 ; times <= times_max; times = times +1 ) { clear (LOG); game = "playing"; situation = 0; turn= 0; while (game = "playing") { turn = turn +1; play = selection(situation, STM ); LOG[turn][1] = situation; LOG[turn][2] = play; game = evalua(situation); if game=“playing”{situation = otherPlay (play); game = evalua(situation); } } turnmx = turn; compensation( STM,LOG, game ); } }; // Procedure to Modify probabilities in STM void compensation (STM,LOG, game) { float reward, punish, normal; for ( turn=1 ; turn <= turnmx ; turn = turn + 1 ) { i= LOG[turn][1]; k= LOG[turn][2]; nplays = STM[turn][0] ; // possibles plays // In Reward increase probability If ( game == "Won" ) { reward = ß*(1-STM[i][k] ); STM[i][k] = STM[i][k] + reward; normal = reward/(nplays-1); for ( j = 1; j <= outputs ; j = j+ 1 ) { If (( j <> k ) & ( STM[i][j] <> 0 )) STM[i][j] = STM[i,j] - normal } // In Punishment reduce probability else { punish = ß/2*STM[i][k]; STM[i][k] = STM[i][k] - punish; normal = punish/(nplays-1); for ( j= 1; j <= outputs ; j = j+ 1 ) { If (( j <> k ) & ( STM[i][j] <> 0 )) STM[i][j] = STM[i,j] + normal } } }
Algedonic compensation In case of a Reward (with 0 < ß < 1) For selection i -> k in the STM ( the selected play ) STM(t+1)i,k = STM(t)i,k + ß*(1– STM(t)i,k For the others transitions i à j for j ≠ k STM(t+1)i,j = STM(t)i,j - ß*(1– STM(t)i,k)/(n-1) In case of a Punishment (with 0 < ß < 1) For selection i -> k in the STM ( the selected play ) STM(t+1)i,k = STM(t)i,k - ß*STM(t)i,k For the others transitions i à j for j ≠ k STM(t+1)i,j = STM(t)i,j + ß*STM(t)i,k /(n-1)
Rewards more at the beginning STM[i][k] = STM[i][k] + ß*(1-STM[i][k])*(1-STM[i][k] ) Rewards more at the end STM[i][k] = STM[i][k] + ß* (1-STM[i][k] )*(STM[i][k]) Punishes more at the beginning STM[i][k] = STM[i][k] - ß/2*STM[i][k]*(1-STM[i][k] ) Punishes more at the end STM[i][k] = STM[i][k] - ß/2* STM[i][k]*STM[i][k] CLSNon linear Compensation • Think about the case of a R/P in my everyday life. How much do I listen to a R/P? It depends on: • Who is giving it to me • What is my expectation • The recent evaluations I have had • We can take into account such concerns, for example: • The domain of ß is 0 < ß < 1 • A value of 0 will cause no learning, while a value of 1 will saturate the STM driving to one selection only • A reward/inaction is achieved by using ß = 0 • When punishing a ß / 2 reduces the changes of wrongly updating probabilities Boltzman entropy As measurement of order, entropy can be define as Using Entropy, we can check how well organized is the STM, thus the lower the value of the entropy the more uneven the probabilities are in the STM. S = - STM[i][j] Log2 (STM[i][j] ) for all i,j
CLSSelection Schemes Since we are storing probabilities, their values should sum 1 always per row. The selection process is done using a cumulative distribution number = random(seed); cumulative = 0; selectionMade = 0 ; j= 0; while ( selectionMade <> 0 ) { j= j + 1; cumulative = cumulative + STM[situation,j] ; If ( cumulative > number ) selectionMade = j } However on the expectation of a good result, the selection can use the maximum { max = 1; for ( j= 1; j <= outputs ; j=j+1 ) If ( STM[situation][j] > STM[situation][max]) { max = j ; selectionMade = max } On considering this alternative the use of probabilities is not longer necessary and the a simple histogram calculation can be use to tally the selections made in each row of the STM
As time goes by the STM becomes more prescriptive of the game, showing the right moves (knowledge) Output options j Inputs i Δ time The initial STM represents a descriptive matrix of possible actions in the game. Mostly the rules of the game (information) STM in time In reality is not only Δ time, but also Δ information that reduces the entropy and guides the STM towards the right answers, this the reason why information is also called negentropy
Yet it pays to represent equal boards with the same notation ‘--X-0----’. Knowledge Representation (basic) Could be represented as ‘----0----’ ‘--X-0----’ • Additional to the transitions in a game, it is also necessary to represents other aspects of the game such as: • The board or state of the game • The value of each board state • This becomes an important part of knowledge representation If we use only arrays, you will see that the number of boards will make the amount Information to represent quite big. As a matter of fact this is were data structures first began to be used in stead of arrays and the concept of link list was developed in languages such as IPL, Lisp, Snobol. Why?
100 % Δ performance learning Change of rules == > Relearning stable Δ time Behavior in time Back to stability
100 % Below the preprogrammed It will remain limited until it is reprogrammed Δ performance However It will relearn Slow learning Δ time Learning versus Preprogrammed Knowledge Back to stability