260 likes | 771 Views
An overview of Hulu’s metrics platform. Tristan Reid tristan.reid@hulu.com. Prasan Samtani prasan.samtani@hulu.com. What we do. Streaming video service > 5.5 million subscribers > 20 million unique visitors/month > 1 billion ads/month. It all begins with beacons. Living room device
E N D
An overview of Hulu’s metrics platform Tristan Reid tristan.reid@hulu.com Prasan Samtani prasan.samtani@hulu.com
What we do • Streaming video service • > 5.5 million subscribers • > 20 million unique visitors/month • > 1 billion ads/month
It all begins with beacons Living room device (Roku, Xbox, etc) Beacon collection service Mobile device (Android, iPhone, etc) Web (hulu.com)
What’s in a beacon 80 2013-04-01 00:00:00 /v3/playback/start? bitrate=650 &cdn=Akamai &channel=Anime &clichéent=Explorer &computerguid=EA8FA1000232B8F6986C3E0BE55E9333 &contentid=5003673 …
Reporting platform (RP2) Find Metrics & Dimensions Design and execute reports
The pipeline Beacon collection service Devices Devices Devices LogCollector/Flume HDFS Monitoring (metstat) MapReduce jobs/JobScheduler Developers Hive Reporting (RP2) Harpy – continuous aggregation RDBMS Business
Log Collection Devices Devices Devices … Log Collection machine #1 Log Collection machine #11 Load balancer HDFS Files bucketed by beacon type and partitioned by hour
If a program manipulates a large amount of data, it does so in a small number of ways - Alan Perlis
The BeaconSpec compiler Java MapReduce code that can run on the cluster Definitions of beacons and base-facts Beaconspec compiler
What does our language look like? basefactplayback_watched_uniquesfrom playback/(position|end) { dimensionharpyhour.id as hourid; dimensioncomputerguid as computerguid; dimensionuserid as userid; required dimension video.id as video_id; required dimensioncontentPartner.id as content_partner_id; … dimensionsiteSessionId.chosen as site_session_id; dimensionfacebook.isfacebookconnected as is_facebook_connected; factsum(watched.out) as watched; } FAQ: Why didn’t we just use Pig?
The superior [program] cultivates itself so as to give rest to [programmers] - Confucius, the Way of the Superior Man
Scheduling jobs Outside world MapReduce job MapReduce job MapReduce job JobMonitor JobMonitor JobMonitor JobScheduler Interface JobScheduler Logmanager databases Checks databases for jobs that are ready to run and whether dependencies are met
JobScheduler technology • The actor model of concurrency • Communication through async messaging • Completely encapsulated state
Message passing Actor creation Central idea: Treat local objects as if they are distributed, as opposed to treating distributed objects as if they are local
Harpy – continuous aggregations Harpy Metadata Queue Processor Hive DataSync Publishing HDFS NFS Holding Sweeper Agg Scheduler HoldingDB Output DBs
RP2 • Reporting Portal for pulling Metrics + Dimensions • Quick ‘Demo’
Let’s Reexamine the pipeline: Beacon collection service Devices Devices Devices LogCollector/Flume HDFS Monitoring (metstat) MapReduce jobs/JobScheduler Developers Hive Reporting (RP2) Harpy – continuous aggregation RDBMS Business
Metstat • Python Django App • Tasks on Celery + RabbitMQ • JQuery • Tracks status, status changes and statistics • Gets data directly from various sources (databases, HDFS)
FAQ: Why didn’t we just use Pig? • Dataflow language – runs on Hadoop • Pig philosophy • (Taken from the Apache website) • Pigs eat anything • Pigs live anywhere • Pigs are domestic animals • Pigs fly Beaconspec
REGISTER ./tutorial.jar; raw = LOAD 'excite.log' USING PigStorage('\t') AS (user, time, query); clean1 = FILTER raw BY org.apache.pig.tutorial.NonURLDetector(query); clean2 = FOREACH clean1 GENERATE user, time, org.apache.pig.tutorial.ToLower(query) as query; Beware of the Turing tar-pit where everything is possible but nothing of interest is easy - Alan Perlis Beaconspec
FAQ: What is open sourced? • Slickint – database interface generation for Scala • github.com/zenbowman/slickint • Local filesystem caching for hadoop • github.com/ZenBowman/luna