1 / 12

Machine Learning at Orbitz

Learn how Orbitz established a Machine Learning team to enhance customer experience through hotel sorting optimization, cache efficiency, and personalized search results. Overcoming data challenges through Hadoop and Hive infrastructure to drive successful data-driven approaches for hotel cache optimization. Discover the importance of traffic partitioning and TTL optimization for improved search performance.

Download Presentation

Machine Learning at Orbitz

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Machine Learning at Orbitz Robert Lancaster and Jonathan Seidman Strata 2011 February 02 | 2011

  2. Launched: 2001, Chicago, IL

  3. Why Start the Machine Learning Team at Orbitz? • Team was created in 2009 with the goal to apply machine learning techniques to improve the customer experience. • For example: • Hotel sort optimization: How can we improve the ranking of hotel search results in order to show consumers hotels that more closely match their preferences? • Cache optimization: can we intelligently cache hotel rates in order to optimize the performance of hotel searches? • Personalization/segmentation: can we show targeted search results to specific consumer segments?

  4. Data Challenges • The team immediately faced challenges getting access to data: • Performing required analysis requires access to large amounts of data on user interaction with the site. • This data is available in web analytics logs, but required fields were not available in our data warehouse because of size considerations. • Even worse, we had no archive of the data beyond several days. • Size constraints aside, there’s considerable time and effort to get new data added to the data warehouse.

  5. New Data Infrastructure to Address These Challenges • Hadoop provides a solution to these challenges by: • Providing long-term storage of entire raw dataset without placing constraints on how that data is processed. • Allowing us to immediately take advantage of new web analytics data added to the site. • Providing a platform for efficient analysis of data, as well as preparation of data for input to external processes for further analysis. • Hive was added to the infrastructure to provide structure over the prepared data, facilitating ad-hoc queries and selection of specific data sets for analysis. • Data stored in Hive not only supports machine learning efforts, but also provides metrics to analysts not available through other sources.

  6. New Data Infrastructure – Cont’d • Hadoop and Hive are now being used by the machine learning team to: • Extract data from logs for hotel sort and cache optimization analyses. • Distribute complex cross-validation and performance evaluation operations. • Extracting data for clustering. • Hadoop and Hive have also gained rapid adoption in the organization beyond the machine learning team: evaluating page download performance, searching production logs, keyword analysis, etc.

  7. Use Case – Hotel Cache Optimization Overview: Search methodology: • Subset of total properties in a location (1 page at a time). • Get “just enough” information to present to consumers. Caching: • Reduces impact to suppliers (maintain “look-to-book” ratio). • Reduces latency. • Increases “coverage.” Optimization Goal: Improve the customer experience (reduce latency, increase coverage) when searching for hotel rates while controlling impact on suppliers (maintain look-to-book).

  8. Hotel Cache Optimization – Early Attempts Early approaches were well intended, but were not driven by analysis of the available data. For example: Theory: High amount of thrashing leads to eviction of more useful cache entries. Attempted Solution: Increase cache size. Result: No increase in measured coverage. Problem: No actual analysis on required cache size. Theory: Locally managed inventory represents “free” information and can be requested without limit to improve coverage. Attempted Solution: Don’t cache locally managed inventory. Increase the amount of local inventory requested with each user search. Result: No increase in measured coverage. Problem: Locally managed inventory doesn’t represent a large percentage of total inventory and is already highly preferenced.

  9. Hotel Cache Optimization – Data Driven Approaches Data Driven Approaches: Traffic Partitioning: Identify the subset of traffic that is most efficient and optimize that subset through prefetching and increased bursting. TTL Optimization: Use historic logs of availability and rate change information to predict volatility of hotel rates and optimize cache TTL.

  10. Hotel Cache Optimization– Traffic Distribution A small number of queries (3%) make up more than a third of search volume.

  11. Optimize Hotel Cache – Traffic Partitioning Evaluate possible mechanisms for determining most frequent queries. Favor mechanisms that gives high search/query ratio for the greatest percentage of search volume. Test for stability of mechanism across multiple time periods.

  12. Conclusions and Lessons Learned • Start with a manageable problem (ease of measuring success, availability of data, etc.) • Avoid thinking of machine learning team as an R&D organization. • Instead, foster machine learning approaches throughout the organization: • Embed resources on actual feature teams. • Machine learning study groups, etc.

More Related