Building Web Analytics on Hadoop at CBS Interactive

Building Web Analytics on Hadoop at CBS Interactive Michael Sun michael.sun@cbsinteractive.com Big Data Workshop, Boston 03/10/2012

Brands and Websites of CBS interactive, Samples ENTERTAINMENT GAMES & MOVIES SPORTS TECH, BIZ & NEWS MUSIC

CBSi Scale • Top 20 global web property • 235M worldwide monthly unique users • Hadoop Cluster size: • Currently workers: 40 nodes (260 TB) • This month: add 24 nodes, 800 TB total • Next quarter: ~ 80 nodes, ~ 1 PB • DW peak processing: > 500M events/day globally, doubling next quarter (ad logs) 1 - Source: comScore, March 2011

Web Analytics Processing • Collect web logs for web metrics analysis • Web logs by tracking clicks, page views, downloads, streaming video events, ad events, etc • Provide internal metrics for web sites monitoring • A/B testing • Billers apps, external reporting • Ad event tracking to support sales • Provide data service • Support marketing by providing data for data mining • User-centric datastore (stay tuned) • Optimize user experience 1 - Source: comScore, March 2011

2105595680218152960 125.83.8.253 - - [07/Mar/2012:16:00:00 +0000] GET /clear/c.gif?ts=1331136009989&sid=115&ld=www.xcar.com.cn&ldc=a4887174-530c-40df-8002-e06c199ba81a&xrq=fid%3D523%26page%3D10&brflv=10.3.183&brwinsz=1680x840&brscrsz=1680x1050&brlang=zh-CN&tcset=utf8&im=dwjs&xref=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php&srcurl=http%3A%2F%2Fwww.xcar.com.cn%2Fbbs%2Fforumdisplay.php%3Ffid%3D523%26page%3D11&title=%E5%B8%95%E6%9D%B0%E7%BD%97%E8%AE%BA%E5%9D%9B_%E5%B8%95%E6%9D%B0%E7%BD%97%E7%A4%BE%E5%8C%BA_%E5%B8%95%E6%9D%B0%E7%BD%97%E8%BD%A6%E5%8F%8B%E4%BC%9A_PAJERO%E8%AE%BA%E5%9D%9B_XCAR%20%E7%88%B1%E5%8D%A1%E6%B1%BD%E8%BD%A6%E4%BF%B1%E4%B9%90%E9%83%A8 HTTP/1.1 200 42 clgf=Cg+5E02cT/eWAAAAo0Y http://www.xcar.com.cn/bbs/forumdisplay.php?fid=523&page=11 Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.802.30 Safari/535.1 SE 2.X MetaSr 1.0 - 1schemas.append(Schema(( # schemas[0]SchemaField('web_event_id', 'int', nullable=False, signed=True, bits=64),SchemaField('ip_address', 'string', nullable=False, maxlen=15, io_encoding='ascii'),SchemaField('empty1', 'string', nullable=False, maxlen=5, io_encoding='ascii'),SchemaField('empty2', 'string', nullable=True, maxlen=5, io_encoding='ascii'),SchemaField('req_date', 'string', nullable=True, maxlen=30, io_encoding='ascii'),SchemaField('request', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='ascii'),SchemaField('http_status', 'int', nullable=True, signed=True),SchemaField('bytes_sent', 'int', nullable=True, signed=True),SchemaField('cookie', 'string', nullable=True, maxlen=100, on_range_error='truncate', io_encoding='utf-8'),SchemaField('referrer', 'string', nullable=True, maxlen=1000, on_range_error='truncate', io_encoding='utf-8'),SchemaField('user_agent', 'string', nullable=True, maxlen=2000, on_range_error='truncate', io_encoding='utf-8'),SchemaField('is_clear_gif_mask', 'int', nullable=False, on_null='default', on_type_error='default', signed=True, bits=2))))

Modernize the platform • The web log processing using a proprietary platform ran into the limit • Code base was 10 years old • The version we used vendor is no longer supporting • Not fault-tolerant • Upgrade to the newer version not cost-effective • Data volume is increasing all the time • 300+ web sites • Video tracking increasing the fastest • To support new initiatives of business • Use open source systems as much as possible

Hadoop to the Rescue / Research • Open-source: scalable data processing framework based on MapReduce • Processing PB of data using Hadoop Distributed files system (HDFS) • high throughput • Fault-Tolerant • Distributed computing model • Functional programming model based • MapReduce (M|S|R) • Execution engine • Used as a cluster for ETL • Collect data (distributed harvester) • Analyze data (M/R, streaming + scripting + R, Pig/Hive) • Archive data (distributed archive)

The Plan • Build web logs collection (codename Fido) • Apache web log piped to cronolog • Hourly M/R collector job to • Gzip hourly log files & checksum • Scp from web servers to Hadoop datanodes • Put on HDFS • Build Python ETL framework (codename Lumberjack) • Based stdin/stdout streaming, one process/one thread • Can run stand-alone or on Hadoop • Pipeline • Filter • Schema • Build web log processing with Lumberjack • Parse • Sessionize • Lookup • Format data/Load to DB

Web Analytics Billers Data mining Sites DW Database Web metrics Hadoop Apache Logs Python-ETL MapReduce Hive Distribute log by Fido External data sources CMS Systems HDFS

Clickmap

Web log Processing by Hadoop Streaming and Python-ETL • Parsing web logs • IAB filtering and checking • Parsing user agents by regex • IP range lookup • Look up product key etc • Sessionization • Prepare Sessionize • Sessionize • Filter-unpack • Process huge dimensions, URL/Page Title • Load Facts • Format Load data/Load data to DB

Benefits to Ops • Processing time to reaching SLA, saving 6 hours • Running 2 years in production without any big issues • Withstood the test of 50% / year data volume increase • Architecture by design made easy to add new processing logic • Robust and Fault-Tolerant • Five dead datanodes, jobs still ran OK • Upgraded JVM on a few datanodes while jobs running • Reprocessing old data while processing data of current day

Conclusions I – Create Tool Appropriate to the Job if it doesn’t have what you want • Python ETL Framework and Hadoop Streaming together can do complex, big volume ETL work • Python ETL Framework • Home grown, under review for open-source release • Rich functionalities by Python • Extensible • NLS support • Put on top of another platform, eg Hadoop • Distributed/Parallel • Sorting • Aggregation

Conclusions II – Power and Flexibility for Processing Big Data • Hadoop - scale and computing horsepower • Robustness • Fault-tolerance • Scalability • Significant reduction of processing time to reach SLA • Cost-effective • Commodity HW • Free SW • Currently: • Build Multi-tenant Hadoop clusters using Fair Scheduler

Questions?michael.sun@cbsinterative.comFollow up Lumberjacklumberjack@cbsinteractive.com

Building Web Analytics on Hadoop at CBS Interactive