220 likes | 234 Views
Randomized Strategies and Temporal Difference Learning in Poker. Michael Oder April 4, 2002 Advisor: Dr. David Mutchler. Overview. Perfect vs. Imperfect Information Games Poker as Imperfect Information Game Randomization Neural Nets and Temporal Difference Experiments Conclusions
E N D
Randomized Strategies and Temporal Difference Learning in Poker Michael Oder April 4, 2002 Advisor: Dr. David Mutchler
Overview • Perfect vs. Imperfect Information Games • Poker as Imperfect Information Game • Randomization • Neural Nets and Temporal Difference • Experiments • Conclusions • Ideas for Further Study
Perfect vs. Imperfect Information • World-class AI agents exist for many popular games • Checkers • Chess • Othello • These are games of perfect information • All relevant information is available to each player • Good understanding of imperfect information games would be a breakthrough
Poker as an Imperfect Information Game • Other players’ hands affect how much will be won or lost.However, each player is not aware of this vital information. • Non-deterministic aspects as well
Enter Loki • One of the most successful computer poker players created • Produced at University of Alberta by Jonathan Schaeffer et al • Employs randomized strategy • Makes player less predictable • Allows for bluffing
Probability Triples • At any point in a poker game, player has 3 choices • Bet/Raise • Check/Call • Fold • Assign a probability to each possible move • Single move is now a probability triple • Problem: Associate payoff with hand, betting history, and triple (move selected)
Neural Nets • One promising way to learn such functions is with a neural network • Neural Networks consist of connected neurons • Each connection has a weight • Input game state, output a prediction of payoff • Train by modifying weights • Weights are modified by an amount proportional to learning rate
Neural Net Example hand P(2) P(1) P(-1) P(-2) history triple
Temporal Difference • Most common way to train multiple layer neural net is with backpropagation • Relies on simple input-output pairs. • Problem: need to know correct answer right away in order to train nets • Solution: Temporal Difference (TD) learning. • TD(λ) algorithm developed by Richard Sutton
Temporal Difference (cont’d) • Trains responses over the course of a game over many time steps • Tries to make each prediction closer to the prediction in the next time step P1 P2 P3 P4 P5
University of Mauritius Group • TD Poker program produced by group supervised by Dr. Mutchler • Provides environment for playing poker variants and testing agents
Simple Poker Game • Experiments were conducted on extremely simple variant of Poker • Deck consists of 2, 3, and 4 of Hearts • Each player gets one card • One round of betting • Player with highest card wins the pot • Goal: Get the net to produce accurate payoff values as outputs
Early Results • Started by pitting a neural net player against a random one • Results were inconsistant • Problem: Innappropriate value for learning rate • Too low: Outputs never approach true payoffs • Too high: Outputs fluctuate between too high and too low
Experiment Set I • Conjecture: Learning should occur with very small learning rate over many games • Learning Rate = 0.01 • Train for 50,000 games • Only set to train when card is a 4 • First player always bets, second player tested • Two Choices • call 80%, fold 20% -> avg. payoff = 1.4 • call 20%, fold 80% -> avg. payoff = -0.4 • Want payoffs to settle in on average values
Results • 3 out of 10 trials came within 0.1 of the correct result for the highest payoff • 2 out of 10 trials came within 0.1 of the correct result for the lowest payoff • None of the trials came within 0.1 of the correct result for both • The results were in the correct order in only half of the trials
More Distributions • Repeated experiment with six choices instead of two • call 100% -> avg. payoff = 2.0 • call 80%, fold 20% -> avg. payoff = 1.4 • call 60%, fold 40% -> avg. payoff = 0.8 • call 40%, fold 60% -> avg. payoff = 0.2 • call 20% fold 80% -> avg. payoff = -0.4 • fold 100% -> avg. payoff = -1.0 • Using more distributions did help the program learn to order value of the distributions correctly • All six distributions were ranked correctly 7 out of 10 times (0.14% chance for any one trial)
Output Encoding • Distributions are ranked correctly, but many output values are still inaccurate. • Seems to be largely caused by the encoding of outputs • Network has four outputs, each representing probability of a specific payoff • This encoding is not expandable, and four outputs must all be correct for good payoff prediction.
Relative Payoff Encoding • Replace four outputs with single number • The number represents the payoff relative to highest payoff possibleP = 0.5 + (winnings/total possible) • Total possible winnings determined at beginning of game (sum of other players’ holdings) • Repeated previous experiments using this encoding
Results (Experiment Set 2) • Payoff predictions were generally more accurate using this encoding • 5 out of 10 trials got exact payoff (0.502) for best distribution choice with six choices available • Most trials had very close value for payoff associated with one of the distributions • However, no trial was significantly close on multiple probability distributions
Observations/Conclusions • Neural Net player can learn strategies based on probability • Payoff is successfully learned as a function of betting action • Consistency is still a problem • Trouble learning correct payoffs for more than one distribution
Further Study • Issues of expandability • Coding for multiple-round history • Can previous learning be extended? • Variable learning rate • Study distribution choices • Sample some bad distribution choices • Test against a variety of other players