1 / 56

Handling Streaming Data in Spotify Using the Cloud

Learn how Spotify handles streaming data using the cloud, including event delivery, reliable queues, ETL, and more.

ashier
Download Presentation

Handling Streaming Data in Spotify Using the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Handling Streaming Data in Spotify Using the Cloud Igor Maravić <igor@spotify.com> Software Engineer Neville Li <neville@spotify.com> Software Engineer

  2. Current Event Delivery System

  3. Current event delivery system Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job

  4. Complex Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job

  5. Stateless Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job

  6. Delivered data growth

  7. RedesigningEvent Delivery

  8. Redesigning event delivery Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

  9. Same API Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

  10. Dedicated event streams Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

  11. Persistence Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

  12. Keep it simple Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

  13. Choosing Reliable Persistent Queue

  14. Kafka 0.8

  15. Event delivery with Kafka 0.8 Client Any data centre Hadoop data centre Client Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)

  16. Event delivery with Kafka 0.8 Client Any data centre Hadoop data centre Client Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)

  17. Cloud Pub/Sub

  18. 2M QPS published to Pub/Sub 2M/s 1.5M/s 1M/s 500k/s 0/s

  19. Cloud Storage Cloud Pub/Sub Event delivery with Cloud Pub/Sub Client Hadoop data centre Any data centre Hadoop Client Gateway Client Syslog Client File Tailer ETL Event Delivery Service

  20. ETL

  21. Event time based hourly buckets 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

  22. Incremental bucket fill 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

  23. Bucket completeness 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

  24. Late data handling 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

  25. Experimentation with Dataflow

  26. ETL as a set of micro-services Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

  27. Consumer Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

  28. Completionist Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

  29. Deduper Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

  30. Where are we right now? Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

  31. A Scala API for Google Cloud Dataflow Scio

  32. Origin Story • Scalding and Spark • ML, recommendations, analytics • 50+ users, 400+ unique jobs • Growing rapidly

  33. Early 2015 - Dataflow Scala hack project Moving to Google Cloud

  34. Why not Scalding on GCE • Pros • CommunityTwitter, eBay, Etsy, Stripe, LinkedIn, … • Stable and proven

  35. Why not Scalding on GCE • Cons • Hadoop cluster operations • Multi-tenancyresource contention and utilization • No streaming mode (Summingbird?)

  36. Why not Spark on GCE • Pros • Batch, streaming, interactive and SQL • MLlib, GraphX • Scala, Python, and R support • Zeppelin, spark-notebook, Hue

  37. Why not Spark on GCE • Cons • Hard to tune and scale • Cluster lifecycle management

  38. Why Dataflow with Scala • Dataflow • Hosted solution, no operations • EcosystemGCS, BigQuery, PubSub, Bigtable, … • Unified batch and streaming model

  39. Why Dataflow with Scala • Scala • High level DSLeasy transition for developers • Reusable and composable code via FP • Numerical libraries: Breeze, Algebird

  40. Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge. Scio

  41. Core API similar to spark-core Some ideas from scalding github.com/spotify/scio

  42. WordCount • Almost identical to Spark version • val sc = ScioContext() • sc.textFile("shakespeare.txt") • .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) • .countByValue() • .saveAsTextFile("wordcount.txt")

  43. PageRank • def pageRank(in: SCollection[(String, String)]) = { • val links = in.groupByKey() • var ranks = links.mapValues(_ => 1.0) • for (i <- 1 to 10) { • val contribs = links.join(ranks).values • .flatMap { case (urls, rank) => • val size = urls.size • urls.map((_, rank / size)) • } • ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) • } • ranks • }

  44. Spotify Running • 60 million tracks • 30m users * 10 tempo buckets * 25 tracks • Audio: tempo, energy, time signature ... • Metadata: genres, categories, … • Latent vectors from collaborative filtering

  45. Personalized new releases • Pre-computed weekly on Hadoop(on-premise cluster) • 100GB recommendationsfrom HDFS to Bigtable in US+EU • 250GB Bloom filters from Bigtable to HDFS • 200 LOC

More Related