1 / 10

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony ). Conviva monitors and optimizes online video for premium content providers. We see 10s of millions of streams every day. Conviva data processing architecture. Monitoring Dashboard.

booker
Download Presentation

How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@ DilipAntony )

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How Conviva used Spark to Speed Up Video Analytics by 25x Dilip Antony Joseph (@DilipAntony)

  2. Conviva monitors and optimizes online video for premium content providers We see 10s of millions of streams every day

  3. Conviva data processing architecture Monitoring Dashboard Live data processing Reports Video Player Hadoop for historical data Ad-hoc analysis Spark

  4. Group By queries dominate our workload • SELECTvideoName, COUNT(1) • FROM summaries • WHERE date='2011_12_12' AND customer='XYZ’ • GROUP BY videoName; • 10s of metrics, 10s of group bys • Hive scans data again and again from HDFS Slow • Conviva GeoReporttook ~ 24 hours using Hive

  5. Group By queries can be easily written in Spark val sessions = sparkContext.sequenceFile[SessionSummary, NullWritable]( pathToSessionSummaryOnHdfs, classOf[SessionSummary], classOf[NullWritable]) .flatMap { case (key, val) => val.fieldsOfInterest } valcachedSessions = sessions.filter( whereConditionToFilterSessionsForTheDesiredDay) .cache valmapFn : SessionSummary => (String, Long) = { s => (s.videoName, 1) } valreduceFn : (Long, Long) => Long = { (a,b) => a+b } valresults = cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap

  6. Spark is blazing fast! • Spark keeps sessions of interest in RAM • Repeated group by queries are very fast. • Spark-basedGeoReport runs in 45 minutes (compared to 24 hours with Hive)

  7. Spark queries require more code, but are not too hard to write • Writing queries in Scala – There is a learning curve • Type-safety offered by Scala is a great boon • Code completion via Eclipse Scalaplugin • Complex queries are easier to write in Scala than in Hive. • Cascading IF()s in Hive

  8. Challenges in using Spark • Learning Scala • Always on the bleeding edge – getting dependencies right • More tools required

  9. Spark @ Conviva today • Using Spark for about 1 year • 30% of our reports use Spark, rest use Hive • Analytics portal with canned Spark/Hive jobs • More projects in progress • Anomaly detection • Interactive console to debug video quality issues • Near real-time analysis and decision making using Spark • Blog Entry: http://www.conviva.com/blog/engineering/using-spark-and-hive-to-process-bigdata-at-conviva

  10. jobs@conviva.com We are Hiring

More Related