120 likes | 262 Views
Machine Learning at Orbitz. Robert Lancaster and Jonathan Seidman Strata 2011 February 02 | 2011. Launched: 2001, Chicago, IL. Why Start the Machine Learning Team at Orbitz?. Team was created in 2009 with the goal to apply machine learning techniques to improve the customer experience.
E N D
Machine Learning at Orbitz Robert Lancaster and Jonathan Seidman Strata 2011 February 02 | 2011
Why Start the Machine Learning Team at Orbitz? • Team was created in 2009 with the goal to apply machine learning techniques to improve the customer experience. • For example: • Hotel sort optimization: How can we improve the ranking of hotel search results in order to show consumers hotels that more closely match their preferences? • Cache optimization: can we intelligently cache hotel rates in order to optimize the performance of hotel searches? • Personalization/segmentation: can we show targeted search results to specific consumer segments?
Data Challenges • The team immediately faced challenges getting access to data: • Performing required analysis requires access to large amounts of data on user interaction with the site. • This data is available in web analytics logs, but required fields were not available in our data warehouse because of size considerations. • Even worse, we had no archive of the data beyond several days. • Size constraints aside, there’s considerable time and effort to get new data added to the data warehouse.
New Data Infrastructure to Address These Challenges • Hadoop provides a solution to these challenges by: • Providing long-term storage of entire raw dataset without placing constraints on how that data is processed. • Allowing us to immediately take advantage of new web analytics data added to the site. • Providing a platform for efficient analysis of data, as well as preparation of data for input to external processes for further analysis. • Hive was added to the infrastructure to provide structure over the prepared data, facilitating ad-hoc queries and selection of specific data sets for analysis. • Data stored in Hive not only supports machine learning efforts, but also provides metrics to analysts not available through other sources.
New Data Infrastructure – Cont’d • Hadoop and Hive are now being used by the machine learning team to: • Extract data from logs for hotel sort and cache optimization analyses. • Distribute complex cross-validation and performance evaluation operations. • Extracting data for clustering. • Hadoop and Hive have also gained rapid adoption in the organization beyond the machine learning team: evaluating page download performance, searching production logs, keyword analysis, etc.
Use Case – Hotel Cache Optimization Overview: Search methodology: • Subset of total properties in a location (1 page at a time). • Get “just enough” information to present to consumers. Caching: • Reduces impact to suppliers (maintain “look-to-book” ratio). • Reduces latency. • Increases “coverage.” Optimization Goal: Improve the customer experience (reduce latency, increase coverage) when searching for hotel rates while controlling impact on suppliers (maintain look-to-book).
Hotel Cache Optimization – Early Attempts Early approaches were well intended, but were not driven by analysis of the available data. For example: Theory: High amount of thrashing leads to eviction of more useful cache entries. Attempted Solution: Increase cache size. Result: No increase in measured coverage. Problem: No actual analysis on required cache size. Theory: Locally managed inventory represents “free” information and can be requested without limit to improve coverage. Attempted Solution: Don’t cache locally managed inventory. Increase the amount of local inventory requested with each user search. Result: No increase in measured coverage. Problem: Locally managed inventory doesn’t represent a large percentage of total inventory and is already highly preferenced.
Hotel Cache Optimization – Data Driven Approaches Data Driven Approaches: Traffic Partitioning: Identify the subset of traffic that is most efficient and optimize that subset through prefetching and increased bursting. TTL Optimization: Use historic logs of availability and rate change information to predict volatility of hotel rates and optimize cache TTL.
Hotel Cache Optimization– Traffic Distribution A small number of queries (3%) make up more than a third of search volume.
Optimize Hotel Cache – Traffic Partitioning Evaluate possible mechanisms for determining most frequent queries. Favor mechanisms that gives high search/query ratio for the greatest percentage of search volume. Test for stability of mechanism across multiple time periods.
Conclusions and Lessons Learned • Start with a manageable problem (ease of measuring success, availability of data, etc.) • Avoid thinking of machine learning team as an R&D organization. • Instead, foster machine learning approaches throughout the organization: • Embed resources on actual feature teams. • Machine learning study groups, etc.