100 likes | 272 Views
How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony ). Conviva monitors and optimizes online video for premium content providers. We see 10s of millions of streams every day. Conviva data processing architecture. Monitoring Dashboard.
E N D
How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@DilipAntony)
Conviva monitors and optimizes online video for premium content providers We see 10s of millions of streams every day
Conviva data processing architecture Monitoring Dashboard Live data processing Reports Video Player Hadoop for historical data Ad-hoc analysis Spark
Group By queries dominate our workload • SELECTvideoName, COUNT(1) • FROM summaries • WHERE date='2011_12_12' AND customer='XYZ’ • GROUP BY videoName; • 10s of metrics, 10s of group bys • Hive scans data again and again from HDFS Slow • Conviva GeoReporttook ~ 24 hours using Hive
Group By queries can be easily written in Spark val sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest } valcachedSessions = sessions.filter( whereConditionToFilterSessionsForTheDesiredDay) .cache valmapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } valreduceFn : (Long, Long) => Long = { (a,b) => a+b } valresults = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap
Spark is blazing fast! • Spark keeps sessions of interest in RAM • Repeated group by queries are very fast. • Spark-basedGeoReport runs in 45 minutes (compared to 24 hours with Hive)
Spark queries require more code, but are not too hard to write • Writing queries in Scala – There is a learning curve • Type-safety offered by Scala is a great boon • Code completion via Eclipse Scalaplugin • Complex queries are easier to write in Scala than in Hive. • Cascading IF()s in Hive
Challenges in using Spark • Learning Scala • Always on the bleeding edge – getting dependencies right • More tools required
Spark @ Conviva today • Using Spark for about 1 year • 30% of our reports use Spark, rest use Hive • Analytics portal with canned Spark/Hive jobs • More projects in progress • Anomaly detection • Interactive console to debug video quality issues • Near real-time analysis and decision making using Spark • Blog Entry: http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva
jobs@conviva.com We are Hiring