560 likes | 573 Views
Learn how Spotify handles streaming data using the cloud, including event delivery, reliable queues, ETL, and more.
E N D
Handling Streaming Data in Spotify Using the Cloud Igor Maravić <igor@spotify.com> Software Engineer Neville Li <neville@spotify.com> Software Engineer
Current event delivery system Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job
Complex Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job
Stateless Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job
Redesigning event delivery Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer
Same API Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer
Dedicated event streams Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer
Persistence Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer
Keep it simple Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer
Event delivery with Kafka 0.8 Client Any data centre Hadoop data centre Client Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)
Event delivery with Kafka 0.8 Client Any data centre Hadoop data centre Client Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)
2M QPS published to Pub/Sub 2M/s 1.5M/s 1M/s 500k/s 0/s
Cloud Storage Cloud Pub/Sub Event delivery with Cloud Pub/Sub Client Hadoop data centre Any data centre Hadoop Client Gateway Client Syslog Client File Tailer ETL Event Delivery Service
Event time based hourly buckets 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H
Incremental bucket fill 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H
Bucket completeness 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H
Late data handling 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H
ETL as a set of micro-services Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper
Consumer Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper
Completionist Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper
Deduper Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper
Where are we right now? Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper
Origin Story • Scalding and Spark • ML, recommendations, analytics • 50+ users, 400+ unique jobs • Growing rapidly
Early 2015 - Dataflow Scala hack project Moving to Google Cloud
Why not Scalding on GCE • Pros • CommunityTwitter, eBay, Etsy, Stripe, LinkedIn, … • Stable and proven
Why not Scalding on GCE • Cons • Hadoop cluster operations • Multi-tenancyresource contention and utilization • No streaming mode (Summingbird?)
Why not Spark on GCE • Pros • Batch, streaming, interactive and SQL • MLlib, GraphX • Scala, Python, and R support • Zeppelin, spark-notebook, Hue
Why not Spark on GCE • Cons • Hard to tune and scale • Cluster lifecycle management
Why Dataflow with Scala • Dataflow • Hosted solution, no operations • EcosystemGCS, BigQuery, PubSub, Bigtable, … • Unified batch and streaming model
Why Dataflow with Scala • Scala • High level DSLeasy transition for developers • Reusable and composable code via FP • Numerical libraries: Breeze, Algebird
Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge. Scio
Core API similar to spark-core Some ideas from scalding github.com/spotify/scio
WordCount • Almost identical to Spark version • val sc = ScioContext() • sc.textFile("shakespeare.txt") • .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) • .countByValue() • .saveAsTextFile("wordcount.txt")
PageRank • def pageRank(in: SCollection[(String, String)]) = { • val links = in.groupByKey() • var ranks = links.mapValues(_ => 1.0) • for (i <- 1 to 10) { • val contribs = links.join(ranks).values • .flatMap { case (urls, rank) => • val size = urls.size • urls.map((_, rank / size)) • } • ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) • } • ranks • }
Spotify Running • 60 million tracks • 30m users * 10 tempo buckets * 25 tracks • Audio: tempo, energy, time signature ... • Metadata: genres, categories, … • Latent vectors from collaborative filtering
Personalized new releases • Pre-computed weekly on Hadoop(on-premise cluster) • 100GB recommendationsfrom HDFS to Bigtable in US+EU • 250GB Bloom filters from Bigtable to HDFS • 200 LOC