1 / 35

Learning to Price Airline Seats Under Competition

Learning to Price Airline Seats Under Competition. 7th Annual INFORMS Revenue Management and Pricing Conference Barcelona, Spain Thursday, 28 th June 2007 Presenter : Andrew Collins a.j.collins@soton.ac.uk Supervisor : Prof Lyn Thomas. Overview. Motivation

nico
Download Presentation

Learning to Price Airline Seats Under Competition

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning to Price Airline Seats Under Competition 7th Annual INFORMS Revenue Management and Pricing Conference Barcelona, Spain Thursday, 28th June 2007 Presenter: Andrew Collins a.j.collins@soton.ac.uk Supervisor: Prof Lyn Thomas

  2. Overview • Motivation • Reinforcement Learning • Methodology • Model • Results • Conclusions

  3. Frustration with Game Theory: • Difficulty with deriving feasible solutions • Difficulty in validating results (due to simplifications) • Dependency on input variables • Speed and Memory issues of running models Research at University of Southampton: “To demonstrate Game Theory as a practical analytical modelling technique for usage within the OR community” Motivation • Game Theory project manager: • 2001- 2004 • Defence Science and Technology Laboratories (Dstl), UK Applications of Game Theory in Defence Project - A. Collins, F. Pullum, L. Kenyon (2003) [Dstl - Unclassified]

  4. Learning in Games • Brown’s Fictitious play (1951) • Fudenberg and Levine (1998) Evolutionary • Weibull (1995) • Replicator Dynamics Neural Networks • Just a statistical process in the limit • Neal (1996) Reinforcement Learning • Association with Psychology • Palvov (1927) • Rescorla and Wagner (1972) • Convergence • Collins and Leslie (2005) Theory of Learning in Games - Drew Fudenberg and David Levine (1998)

  5. C B A Reinforcement Learning Introduction

  6. C B A Agent Action “a” Reward “r” State “s” Environment Reinforcement Learning (RL) A.K.A. ‘Neuro-Dynamic Programming’ or ‘Approximate Dynamic Programming’. Agents/players reinforce their world-view from interaction with the environment. Neuro-Dynamic Programming - Dimitri Bertsekas and John Tsitsiklis (1996)

  7. Type Update ‘U’ Monte Carlo Q-Learning SARSA MC QL SA R(next s) maxaQ(a, next s) Q(next a, next s) Types • Players store information about each state-action pair (called the Q-value) • They use this information to select an action when at that state • They updated this information depending on a rule: Q(a, s) = (1 - ).Q(a, s) + .(reward(a, s) + U(next s)) ‘U’ depends on the RL type used. It usually involves the: • Observed return ‘R’, post current state • Current Q-value estimates of proceeding states (called bootstrapping) Reinforcement Learning - Richard Sutton and Andrew Barto (1998)

  8. Issues How do players select their actions? • Exploration vs. exploitation • Boltzmann Action Selection (a.k.a. Softmax) • Similar to Logit Models Leads to Nash distribution: • Unique •  Nash Equilibrium as   0 Uncoupled Games • Hart and Mas-Colell (2003, 2006) Stochastic Uncoupled Dynamics and Nash Equilibrium - Sergiu Hart and Andreu Mas-Colell (2006)

  9. Methodology Introduction

  10. Analyse the optimal solution for the model • Find optimal solution using Dynamic Programming • Deduce generalisation from these results • Run various Reinforcement Learning (RL) models • Compare to ‘optimal’ solutions • Prove RL converges using Stochastic Approximation Methodology • Construct a simple AIRLINE pricing model • Dynamic Pricing of Airline Tickets with Competition • Currie, Cheng and Smith (2005) • Reinforcement Learning Approach to Airline Seat Allocation • Gosavi, Bandla and Das (2002) Tools for Thinking: Modelling in Management Science - Mike Pidd (1996)

  11. Repeat Start Compare Episode Generation Players learn about Environment Policy updated Reinforcement Learning Optimal Policy Backward Induction Dynamic Programming Airline Pricing Model: Flow Diagram

  12. Airline Pricing Model Introduction to an Airline Pricing Model

  13. Airline Pricing Model The game consists of two competing airline firms. The firms are ‘P1’ and ‘P2’. • Each firm is selling seats for a single leg flight • Both flights are identical • Firms attract customers with their prices A separate model is used for customer demand. The Theory and Practice of Revenue Management - K. Talluri and G. van Ryzin (2004)

  14. End Round 1 End Round 2 Simple Airline Pricing Model P1 sets Price P2 sets Price P1 Price Change P2 Price Change Flights Leave Customer - Lowest Price

  15. Airline Pricing Model Solution Example to Simple Airline Model

  16. 9 8 0 8 8 Solution Example 1+1=? 1 2 1 2 8

  17. Player ‘1’ can now attempt to attract one or both of the remaining customers. However, player ‘2’ still has a chance to undercut to gain the last customer. Solution Example 1+1=? 1 2 1 2

  18. 9 9 10 8 8 9 9 8 8 Solution Example 1+1=? 1 2 1 2 9

  19. Solution Example 1+1=? 1 2 1 2

  20. Comparison Using metrics to compare policies

  21. Comparisons Once we have learned a policy, how do you compare policies? • Q-values or action probabilities • Difficultly in weighting states What I really care about is return, so: • Compare return from each path • Curse of dimensionality • Produce the Return Probability Distribution (RPD) of the different policies played against some standard policies: • Nash distribution, Nash equilibrium, myopic play, random play, etc. • Would need to compare ALL possibilities to be sure of convergence

  22. P1 P2 Nash Equilibrium: Derived from play of (5, 10, 9,8). The BLUE bar are for P2 and RED for P1.

  23. Nash Distribution :  = 0.0020. Difference so small that you not notice them here.

  24. Nash Distribution :  = 0.0050. We can now see a slight change in the distribution

  25. Nash Distribution :  = 0.0100. Notice there is more variation for P2 than P1.

  26. Nash Distribution :  = 0.0200. Notice that P1 is observing some very bad results.

  27. Nash Distribution :  = 0.2000. Almost get random play (see next slide).

  28. Random Play: Notice that the expected rewards are even as it does not matter the order of players.

  29. Metrics If a policy is very similar to a another policy, we would expect to see similar RPD from both policies, when played against the standard policies. How do we compare RPD? • L1-metric meaningless…. • Hellinger, Kolmogorov-Smirov, Gini, Information value, Separation, Total Variation, Mean, Chi-squared… On Choosing and Bounding Probability Metrics -Alison Gibbs and Francis Su (2002)

  30. Example Metric Results 1+1=? Metric comparison of the RPDs of: 1) Nash Equilibrium policy vs. SARSA learnt policy 2) Nash Equilibrium policy vs. Nash Equilibrium policy Greedy action selection used for calculating RPD. The x-axis is a log-scale of episodes. 10M episodes run in total.

  31. C B A Reinforcement Learning Model Results

  32. Tau variation Results compare learning policy’s RPD to corresponding Nash Distribution policy’s RPD. MC seems to improve as exploration increases. Why not increase exploration?

  33. Other Issues 1) Stability • Excess exploration implies instability • Higher dependency on the most recent observation implies instability 2) Computing • Batch runs: 100 x 10M episodes • 2.2Ghz 4Gb RAM • Time considerations • 23 hrs • Memory requirements • 300MB 3) Curse of Dimensionality • Wish to increase number of rounds 4) Customer Behaviour • Wish to change customer behaviour (i.e. multiple customers, Logit models) Simulation-Based Optimization -Abhijit Gosavi (2003)

  34. Conclusions A simple airline pricing model can lead to some interesting results. Understanding the meaning of these results might give insight into real-world pricing policy. By trying to use the RL algorithm to solve this model, interesting behaviour is observed. • Curse of Dimensionality • Stability SARSA RL method out performs other methods for certain exploration levels.

  35. Questions? Andrew Collins a.j.collins@soton.ac.uk

More Related