1 / 17

Online Machine Learning with Distributed In-memory Clusters

Online Machine Learning with Distributed In-memory Clusters. Arshak Navruzyan, VP Product Argyle Data Acknowledgements Nina Zumel, PhD – Win-Vector LLC. Contents. Business use-cases DevOps / Continuous Deployment Quality of Service (QOS) Quality of Experience ( QoE )

Download Presentation

Online Machine Learning with Distributed In-memory Clusters

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Online Machine Learning with Distributed In-memory Clusters • Arshak Navruzyan, VP Product • Argyle Data • Acknowledgements • Nina Zumel, PhD – Win-Vector LLC

  2. Contents • Business use-cases • DevOps / Continuous Deployment • Quality of Service (QOS) • Quality of Experience (QoE) • Machine learning methods for anomaly detection • Nearest Neighbor (NN) • Isolation Forest (iForest) • Random Forests (RF)

  3. Ecommerce operator use case • Large distributed search farm logs usage data into in-memory grid • Mixed measurements • Continuous: query latency, resource utilization, etc. • Categorical: client IP, server IP, code version, etc. • ~1-3 TB of search logs every 12 hours • Find the anomalies • Data isn’t easy to characterize due to size • Anomalies are across multiple variables (combination of server, code version, latency) • No labeled data is available • High rate of false-positives at this scale is a flood of data • Very few ML methods operate at this scale

  4. In-memory grid provides live view of the data • Distributed SQL store • Read / write optimized • Automatic placement (sharding) • Fast ingest (million inserts per sec.) • Fast aggregations (billion rows per sec.) • Holds ~30 days of data online • Insert also sets time-to-live (ttl) • Simple monitoring tool • D3.js based horizon graphs • Socket.io / Node.js

  5. Isn’t there an open source project that does this? • Etsy’s continuous deployment problem • 1.5b page views, 1M users, $117m in goods sold • 250+ commiters, everyone deploys! • 30+ deploys to production a day • ~8 commits per deploy • How do you NOT break production?

  6. How do you monitor such an environment like Etsy? • Usual suspects: IT monitoring tools like Ganglia, Nagios • But … “Not all things that break throw errors” • Etsy’s Kale • StatsD - StatsD::increment(“foo.bar”) • Skyline – Real-time anomaly detection system • Oculus – Anomaly correlation system using dynamic warping

  7. Skyline & Oculus

  8. Did they solve the problem? • Skyline’s Anomaly Detection • Basic principle: “Metric is anomalous if it’s latest datapoint is over three standard deviations above the moving average” • Ensemble of methods from tailing average, median absolute deviation, grubbs, stdev from moving average, least squares (3 sigma), Kolmogorov-Smirnov test, etc. • Results get better with ensemble technqiue but still very noisy • Non-normal distributions • Spike influence • Periodicity / seasonality

  9. Machine learning based anomaly detection • “Benchmarking algorithms for detecting anomalies in large datasets” -Uriel Carrasquilla (2010)

  10. Y N1 o1 O3 o2 N2 X Nearest-neighbor approach • N1 and N2 are regions of normal behaviour • Points o1 and o2 are anomalies • Points in region O3 are anomalies • Advantage • No need for assumptions about data distribution • No need to have pre-labelled anomalies • Supports categorical as wells as continuous variables • Drawbacks • Computationally expensive – quadratic in data volume (every point has to be compared to every other point) • Existing implementations are batch-oriented • “Anomaly Detection: A Tutorial” - A. Banerjee, V. Chandola, V. Kumar, J. Srivastava • (SIAM International Conference on Data Mining 2008)

  11. Avoiding the quadratic computation time of NN • Bay & Schwabacher SIGKDD 2003 • Outliners in Near Linear Time with Randomization and a Simple Pruning Rule • Only anomalies are compared to every point in the data set • If anomalies are rare, points only get compared to a small constant number of points • Challenges that remain • Batch learner won’t work for our scale • How would this work in a sharded environment? • Linear time is still long for “Big Data”

  12. Nearest-neighbor as an online learner • ORCA concepts can be extended to online learning • Randomly pull a small sample of the data that is representative of a time period • Compare the streaming data (latest observations) to the representative set • Representative set moves with time (sliding window) so that noon is compared to typical noon (this addresses periodicity) • Other enhancements • Every node needs the full reference set • Distributed cache moves reference set between nodes • Locality sensitive hashing

  13. Random partitioning approach • “Isolation Forest” – Liu IEEE (2008) • When building a tree, anomalies are likely to be isolated closer to the root of the tree; whereas normal points appear deeper in the tree structure • No need to profile normal data points • No distance or density measures • Gap • No support for categorical attributes • Lacks explanatory power • No obvious way to turn into an online method

  14. Ensemble of trees as an online learner • Random Forest (RF) basics • Take the square root of the number of attributes and build that many trees • Random selection (with replacement) of observations (in-bag/out-of-bag) • Random selection of variables • “Online Random Forests” (ORF) algorithm of Saffari et al., ICCV-OLCV 2009 • How do you perform bagging online? (Oza “Online bagging and Boosting”) • Probability of an individual tree seeing an observation in batch learning mode is Poisson (λ = 1) • How to grow random trees on-the-fly? • Continuously measure info-gain (or gini) of a potential split

  15. Offline vs. Online Training

  16. To summarize • Machine learning can produce more accurate anomaly detection than the stats approach • Nearest Neighbor and Random Forests may be adapted to an online learner • In-memory distributed processing can improve the performance of such algorithms • Supervised methods for classification and regression will work similarly (we think …)

  17. Thank you! www.linkedin.com/in/arshak

More Related