Hadoop in the Wild

Hadoop in the Wild CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

Agenda • Check out some use cases • Discuss some architectures

Use Cases

Common Use Cases • Log Processing • Image Identification • Extract Transform Load • Recommendation Engines • Time-Series Storage and Processing • Building Search Indexes • Long-Term Archive • Audit Logging

Non-Use Cases • Data processing handled by one large server • ACID Transactions

A Bank • Problem • Need to analyze customer activity across multiple products to predict credit risk • Acquired a number of banks • Solution • Setup a single Hadoop cluster with data from multiple EDWs • Bank added new sources of customer service data to get a clear picture of a customer’s financial situation

A Mobile Carrier • Problem • Why are our customers terminating their service contracts? • Solution • Combined transactional and event data with social network data • Combined coverage maps with account data

An Online Dating Service • Problem • Surveys, demographic, and web activity to build a picture • Customers wanted better recommendations • Algorithms improved and number of users grew • Solution • Moved data and analysis to Hadoop • Able to size system to meet needs of customers

Ad Targeting • Problem • Advertising is a special kind of recommendation • Need to select best ad for a particular visitor, but each advertiser is paying to have its ad seen • Solution • Collect stream of user activity with continuous analysis • Build sophisticated models of user behavior

POS Transaction Analysis • Problem • Retailers able to collect much more data in stores and online • EDW do not generally support sophisticated analysis to provide better forecasting • Solution • Loaded 20 years of sales transactions and used Hive to do same analysis as before • Now able to use new algorithms with new data sets

Sensor Data • Problem • Volume of sensor data from every generator across multiple grids is enormous • Clear picture depends on real-time and forensic analysis • Solution • Capture and store all streaming sensor data • Built continuous analysis system to watch performance of generators

Threat Analysis • Problem • How do we detect threats and fraudulent activity in an online world? • Solution • Use of HBase to store virus signatures • Use of MapReduce to compare spam or malware • Lambda Architecture

Trade Surveillance • Problem • Difficult to monitor trades for compliance, and impossible to catch rogue traders • Solution • Store trade data and trading party data • Continuously monitor activity and build connections • Provides cheap storage for law-required auditing

Search • Problem • Indexing stuff is pretty easy, until go and index the Internet • User preferences make it harder • Solution • MapReduce was designed for indexing • Online retailers depend on search for users finding and buying products

Data Sandbox • Problem • ??? • Solution • Simple storage mechanism with diverse tools for data analysis and exploration

Architectures

Building your Data Lake

1 2 3 4

Lambda Architecture Hadoop All Data Precompute Views BATCH LAYER Batch recompute SERVING LAYER QFD N QFD 1 QFD 2 New Data Stream Query Batch views (HDFS/SQL) (Apache HBase) Real-time views QFD N QFD 1 QFD 2 Storm Process Stream Increment Views SPEED LAYER Real-Time Increment

Facebook • EDW (Oracle) was unable to scale and perform • Investigated small Hadoop system • Engineers loved it • Began developing Hive

Facebook • Time-series summaries • Ad hoc jobs over historical data • Long-term archival store for logs • Look up log events by specific attributes

Facebook Architecture

Facebook Messaging • Needed a short set of temporal data • A growing set of data that is rarely accessed • HBase fit their needs more than other open-source technologies

Twitter Architecture

LinkedIn Architecture

LinkedIn Applications

LinkedIn Future • MapReduce is not suited for large graph processing • Batch-oriented nature is not suited for “breaking news”

References • Hadoop: The Definitive Guide, Chapter 16.2 • http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-linkedin-23512853 • http://www.slideshare.net/Hadoop_Summit/hadoop-hardware-twitter-size-does-matter • http://www.forbes.com/sites/edddumbill/2014/01/14/the-data-lake-dream/ • http://www.slideshare.net/brocknoland/common-and-unique-use-cases-for-apache-hadoop • http://blog.cloudera.com/wp-content/uploads/2011/03/ten_common_hadoopable_problems_final.pdf

Hadoop in the Wild