390 likes | 523 Views
From The Lab to the Factory. Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera. About Me. What Do Data Scientists Do?. What I Think I Do. What Other People Think I Do. What I Actually Do. Data Science In the Lab.
E N D
From The Lab to the Factory Building A Production Machine Learning Infrastructure Josh Wills, Senior Director of Data Science Cloudera
A Shift In Perspective Analytics in the Lab Analytics in the Factory Metric-driven Automated Systematic Fluid data Focus on transparency and reliability Output is a production system that makes customer-facing decisions • Question-driven • Interactive • Ad-hoc, post-hoc • Fixed data • Focus on speed and flexibility • Output is embedded into a report or in-database scoring engine
From the Lab to the Factory: First Steps
Introducing Gertrude • Multivariate Testing • Define and explore a space of parameters • Overlapping Experiments • Tang et al. (2010) • Runs multiple independent experiments on every request
Simple Conditional Logic • Declare experiment flags in compiled code • Settings that can vary per request • Create a config file that contains simple rules for calculating flag values and rules for experiment diversion
Separate Data Push from Code Push • Validate config files and push updates to servers • Zookeeper via Curator • File-based • Servers pick up new configs, load them, and update experiment space and flag value calculations
A Few Links I Love • http://research.google.com/pubs/pub36500.html • The original paper on the overlapping experiments infrastrucure at Google • http://www.exp-platform.com/ • Collection of all of Microsoft’s papers and presentations on their experimentation platform • http://www.deaneckles.com/blog/596_lossy-better-than-lossless-in-online-bootstrapping/ • Dean Eckles on his paper about bootstrapped confidence intervals with multiple dependencies
Josh Wills, Director of Data Science, Cloudera @josh_wills Thank you!