Handling Streaming Data in Spotify Using the Cloud

Handling Streaming Data in Spotify Using the Cloud Igor Maravić <igor@spotify.com> Software Engineer Neville Li <neville@spotify.com> Software Engineer

Current Event Delivery System

Current event delivery system Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Complex Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Stateless Client Any data centre Hadoop data centre Client Service Discovery Liveness Monitor Gateway Client Checkpoint Monitor Syslog ACK Brokers Client Hadoop Syslog Producer Brokers Syslog Consumers Groupers Realtime Brokers ETL job

Delivered data growth

RedesigningEvent Delivery

Redesigning event delivery Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Same API Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Dedicated event streams Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Persistence Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Keep it simple Hadoop data centre Hadoop Client Any data centre Client Gateway Client Event Delivery Service Syslog Client Reliable Persistent Queue ETL File Tailer

Choosing Reliable Persistent Queue

Kafka 0.8

Event delivery with Kafka 0.8 Client Any data centre Hadoop data centre Client Brokers Mirror Makers Hadoop Gateway Client Syslog Client Event Delivery Service File Tailer Brokers Camus (ETL)

Cloud Pub/Sub

2M QPS published to Pub/Sub 2M/s 1.5M/s 1M/s 500k/s 0/s

Cloud Storage Cloud Pub/Sub Event delivery with Cloud Pub/Sub Client Hadoop data centre Any data centre Hadoop Client Gateway Client Syslog Client File Tailer ETL Event Delivery Service

ETL

Event time based hourly buckets 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

Incremental bucket fill 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

Bucket completeness 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

Late data handling 2016-03-2123H 2016-03-2200H 2016-03-2201H 2016-03-22 02H 2016-03-2203H 2016-03-2204H

Experimentation with Dataflow

ETL as a set of micro-services Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Consumer Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Completionist Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Deduper Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

Where are we right now? Hadoop data centre Completionist Hadoop Cloud Pub/Sub Cloud Storage Cloud Storage Consumer Deduper

A Scala API for Google Cloud Dataflow Scio

Origin Story • Scalding and Spark • ML, recommendations, analytics • 50+ users, 400+ unique jobs • Growing rapidly

Early 2015 - Dataflow Scala hack project Moving to Google Cloud

Why not Scalding on GCE • Pros • CommunityTwitter, eBay, Etsy, Stripe, LinkedIn, … • Stable and proven

Why not Scalding on GCE • Cons • Hadoop cluster operations • Multi-tenancyresource contention and utilization • No streaming mode (Summingbird?)

Why not Spark on GCE • Pros • Batch, streaming, interactive and SQL • MLlib, GraphX • Scala, Python, and R support • Zeppelin, spark-notebook, Hue

Why not Spark on GCE • Cons • Hard to tune and scale • Cluster lifecycle management

Why Dataflow with Scala • Dataflow • Hosted solution, no operations • EcosystemGCS, BigQuery, PubSub, Bigtable, … • Unified batch and streaming model

Why Dataflow with Scala • Scala • High level DSLeasy transition for developers • Reusable and composable code via FP • Numerical libraries: Breeze, Algebird

Ecclesiastical Latin IPA: /ˈʃi.o/, [ˈʃiː.o], [ˈʃi.i̯o] Verb: I can, know, understand, have knowledge. Scio

Core API similar to spark-core Some ideas from scalding github.com/spotify/scio

WordCount • Almost identical to Spark version • val sc = ScioContext() • sc.textFile("shakespeare.txt") • .flatMap(_.split("[^a-zA-Z']+").filter(_.nonEmpty)) • .countByValue() • .saveAsTextFile("wordcount.txt")

PageRank • def pageRank(in: SCollection[(String, String)]) = { • val links = in.groupByKey() • var ranks = links.mapValues(_ => 1.0) • for (i <- 1 to 10) { • val contribs = links.join(ranks).values • .flatMap { case (urls, rank) => • val size = urls.size • urls.map((_, rank / size)) • } • ranks = contribs.sumByKey.mapValues((1 - 0.85) + 0.85 * _) • } • ranks • }

Spotify Running • 60 million tracks • 30m users * 10 tempo buckets * 25 tracks • Audio: tempo, energy, time signature ... • Metadata: genres, categories, … • Latent vectors from collaborative filtering

Personalized new releases • Pre-computed weekly on Hadoop(on-premise cluster) • 100GB recommendationsfrom HDFS to Bigtable in US+EU • 250GB Bloom filters from Bigtable to HDFS • 200 LOC

Handling Streaming Data in Spotify Using the Cloud

Handling Streaming Data in Spotify Using the Cloud

Presentation Transcript

Data Handling

Handling Data

Data Handling

Data Handling in Science

Handling Data

Handling data using SPSS

Handling Data

Data Migration in the Cloud

Cloud Computing, Data Handling and HPEC Environments

Cloud Streaming

Data Management in the Cloud

Data Handling

Handling Data

Data Handling

Data in the Cloud – I

Data Handling

Handling Data

Data handling

Streaming the Data

Data Handling

DATA HANDLING

Data Handling