Learning to Price Airline Seats Under Competition

Learning to Price Airline Seats Under Competition 7th Annual INFORMS Revenue Management and Pricing Conference Barcelona, Spain Thursday, 28th June 2007 Presenter: Andrew Collins a.j.collins@soton.ac.uk Supervisor: Prof Lyn Thomas

Overview • Motivation • Reinforcement Learning • Methodology • Model • Results • Conclusions

Frustration with Game Theory: • Difficulty with deriving feasible solutions • Difficulty in validating results (due to simplifications) • Dependency on input variables • Speed and Memory issues of running models Research at University of Southampton: “To demonstrate Game Theory as a practical analytical modelling technique for usage within the OR community” Motivation • Game Theory project manager: • 2001- 2004 • Defence Science and Technology Laboratories (Dstl), UK Applications of Game Theory in Defence Project - A. Collins, F. Pullum, L. Kenyon (2003) [Dstl - Unclassified]

Learning in Games • Brown’s Fictitious play (1951) • Fudenberg and Levine (1998) Evolutionary • Weibull (1995) • Replicator Dynamics Neural Networks • Just a statistical process in the limit • Neal (1996) Reinforcement Learning • Association with Psychology • Palvov (1927) • Rescorla and Wagner (1972) • Convergence • Collins and Leslie (2005) Theory of Learning in Games - Drew Fudenberg and David Levine (1998)

C B A Reinforcement Learning Introduction

C B A Agent Action “a” Reward “r” State “s” Environment Reinforcement Learning (RL) A.K.A. ‘Neuro-Dynamic Programming’ or ‘Approximate Dynamic Programming’. Agents/players reinforce their world-view from interaction with the environment. Neuro-Dynamic Programming - Dimitri Bertsekas and John Tsitsiklis (1996)

Type Update ‘U’ Monte Carlo Q-Learning SARSA MC QL SA R(next s) maxaQ(a, next s) Q(next a, next s) Types • Players store information about each state-action pair (called the Q-value) • They use this information to select an action when at that state • They updated this information depending on a rule: Q(a, s) = (1 - ).Q(a, s) + .(reward(a, s) + U(next s)) ‘U’ depends on the RL type used. It usually involves the: • Observed return ‘R’, post current state • Current Q-value estimates of proceeding states (called bootstrapping) Reinforcement Learning - Richard Sutton and Andrew Barto (1998)

Issues How do players select their actions? • Exploration vs. exploitation • Boltzmann Action Selection (a.k.a. Softmax) • Similar to Logit Models Leads to Nash distribution: • Unique •  Nash Equilibrium as   0 Uncoupled Games • Hart and Mas-Colell (2003, 2006) Stochastic Uncoupled Dynamics and Nash Equilibrium - Sergiu Hart and Andreu Mas-Colell (2006)

Methodology Introduction

Analyse the optimal solution for the model • Find optimal solution using Dynamic Programming • Deduce generalisation from these results • Run various Reinforcement Learning (RL) models • Compare to ‘optimal’ solutions • Prove RL converges using Stochastic Approximation Methodology • Construct a simple AIRLINE pricing model • Dynamic Pricing of Airline Tickets with Competition • Currie, Cheng and Smith (2005) • Reinforcement Learning Approach to Airline Seat Allocation • Gosavi, Bandla and Das (2002) Tools for Thinking: Modelling in Management Science - Mike Pidd (1996)

Repeat Start Compare Episode Generation Players learn about Environment Policy updated Reinforcement Learning Optimal Policy Backward Induction Dynamic Programming Airline Pricing Model: Flow Diagram

Airline Pricing Model Introduction to an Airline Pricing Model

Airline Pricing Model The game consists of two competing airline firms. The firms are ‘P1’ and ‘P2’. • Each firm is selling seats for a single leg flight • Both flights are identical • Firms attract customers with their prices A separate model is used for customer demand. The Theory and Practice of Revenue Management - K. Talluri and G. van Ryzin (2004)

End Round 1 End Round 2 Simple Airline Pricing Model P1 sets Price P2 sets Price P1 Price Change P2 Price Change Flights Leave Customer - Lowest Price

Airline Pricing Model Solution Example to Simple Airline Model

9 8 0 8 8 Solution Example 1+1=? 1 2 1 2 8

Player ‘1’ can now attempt to attract one or both of the remaining customers. However, player ‘2’ still has a chance to undercut to gain the last customer. Solution Example 1+1=? 1 2 1 2

9 9 10 8 8 9 9 8 8 Solution Example 1+1=? 1 2 1 2 9

Solution Example 1+1=? 1 2 1 2

Comparison Using metrics to compare policies

Comparisons Once we have learned a policy, how do you compare policies? • Q-values or action probabilities • Difficultly in weighting states What I really care about is return, so: • Compare return from each path • Curse of dimensionality • Produce the Return Probability Distribution (RPD) of the different policies played against some standard policies: • Nash distribution, Nash equilibrium, myopic play, random play, etc. • Would need to compare ALL possibilities to be sure of convergence

P1 P2 Nash Equilibrium: Derived from play of (5, 10, 9,8). The BLUE bar are for P2 and RED for P1.

Nash Distribution :  = 0.0020. Difference so small that you not notice them here.

Nash Distribution :  = 0.0050. We can now see a slight change in the distribution

Nash Distribution :  = 0.0100. Notice there is more variation for P2 than P1.

Nash Distribution :  = 0.0200. Notice that P1 is observing some very bad results.

Nash Distribution :  = 0.2000. Almost get random play (see next slide).

Random Play: Notice that the expected rewards are even as it does not matter the order of players.

Metrics If a policy is very similar to a another policy, we would expect to see similar RPD from both policies, when played against the standard policies. How do we compare RPD? • L1-metric meaningless…. • Hellinger, Kolmogorov-Smirov, Gini, Information value, Separation, Total Variation, Mean, Chi-squared… On Choosing and Bounding Probability Metrics -Alison Gibbs and Francis Su (2002)

Example Metric Results 1+1=? Metric comparison of the RPDs of: 1) Nash Equilibrium policy vs. SARSA learnt policy 2) Nash Equilibrium policy vs. Nash Equilibrium policy Greedy action selection used for calculating RPD. The x-axis is a log-scale of episodes. 10M episodes run in total.

C B A Reinforcement Learning Model Results

Tau variation Results compare learning policy’s RPD to corresponding Nash Distribution policy’s RPD. MC seems to improve as exploration increases. Why not increase exploration?

Other Issues 1) Stability • Excess exploration implies instability • Higher dependency on the most recent observation implies instability 2) Computing • Batch runs: 100 x 10M episodes • 2.2Ghz 4Gb RAM • Time considerations • 23 hrs • Memory requirements • 300MB 3) Curse of Dimensionality • Wish to increase number of rounds 4) Customer Behaviour • Wish to change customer behaviour (i.e. multiple customers, Logit models) Simulation-Based Optimization -Abhijit Gosavi (2003)

Conclusions A simple airline pricing model can lead to some interesting results. Understanding the meaning of these results might give insight into real-world pricing policy. By trying to use the RL algorithm to solve this model, interesting behaviour is observed. • Curse of Dimensionality • Stability SARSA RL method out performs other methods for certain exploration levels.

Questions? Andrew Collins a.j.collins@soton.ac.uk

Learning to Price Airline Seats Under Competition