Pragmatic Big Data Architectures in the Cloud : a Developer's Perspective

Pragmatic Big Data Architectures in the Cloud: a Developer's Perspective Fabiane Bizinella Nardon (@fabianenardon) Fernando Babadopulos (@babadopulos)

Big Data and Us

BIG? how big is

Redis Mahout HDFS HBase Hive Hadoop Pig Cascading Crunch Cassandra MongoDB MySQL

Disruptive Apps! Big Data + Cloud

Nothing will have a bigger impact on your Big Data application performance than making your own code faster You have unlimited resources in the cloud. But the costs are unlimited too When applying Big Data technologies, make sure your data is really big

Nothing will have a bigger impact on your Big Data application performance than making your own code faster

u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 tailtarget.com – 2 cnn.com - 1

Data MapReduce Map Parallelism Reduce New Data

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); String[] parts = line.split(" "); String url = new URL(parts[2]).getHost(); context.write(new Text(url), one); } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } context.write(key, new IntWritable(count)); } }

Map HDFS Local Storage Chunk 1 Record Reader Map Combine Local Storage Chunk 2 Record Reader Map Combine Reduce Copy Sort Reduce

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } context.write(key, new IntWritable(count)); } } job.setMapperClass(Mapp.class); job.setCombinerClass(Reduce.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.submit();

Naive Implementation With Combiner

public static class Mapp extends Mapper<LongWritable, Text, Text, IntWritable> { private Text url = new Text(); private static final IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) { StringTokenizerst = new StringTokenizer(value.toString(), " "); st.nextToken(); st.nextToken(); url.set(new URL(st.nextToken()).getHost()); context.write(url, one); } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable counter = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } counter.set(count); context.write(key, counter); } }

With Combiner Optimized

public static class Mapp extends Mapper<LongWritable, Text, Text, IntWritable> { private Map<String, Integer> items = new HashMap<String, Integer>(); private Text key = new Text(); private IntWritable value = new IntWritable(); public void map(LongWritable key, Text value, Context context) { StringTokenizerst = new StringTokenizer(value.toString(), " "); st.nextToken(); st.nextToken(); String page = new URL(st.nextToken()).getHost(); Integer count = items.get(page); if (count == null) { items.put(page, 1); } else { items.put(page, count + 1); } } public void cleanup(Context context) throws IOException, InterruptedException { for (Entry<String, Integer> item : items.entrySet()) { key.set(item.getKey()); value.set(item.getValue()); context.write(key, value); } } }

Optimized Pre-Combined

The bottleneck usually is caused by the amount of data going across the network

Make sure your architecture is fault tolerant If Map and Reduce have to start over all the time, you won’t get any work done

Hadoop/HDFS do not work well with small files. *We’re talking Big Data, remember?

Data Map MapReduce PIPELINE Reduce

Processing Pipelines u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology

MapReduce Pipelines Redis Mahout Managing HDFS HBase Hive Hadoop Pig Optimizing Cascading Crunch Chaining Cassandra MongoDB MySQL

Crunch Pipeline PTable* PCollection* Data Source Data Target Write DoFN 2 DoFN 1 Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable

Crunch Pipeline PTable* PCollection* Data Source Data Target Write DoFN 2 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable

Crunch Pipeline Ptable* PCollection* Data Source Data Target Write DoFN 1 DoFN 2 DoFN 1 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable

Crunch Pipeline PTable* PCollection* Data Source Data Target Data Target Write DoFN 2 DoFN 1 DoFN 1 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable

MapReduce Pipelines u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology

Pipeline Architecture 1 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 2 3 http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ 4 5 http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology Merge 6 u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology

Pipeline Architecture 1 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 2 {http://www.tailtarget.com/home/, [u=0C010003 ]} {http://cnn.com/news, [u=12070002]} {http://www.tailtarget.com/about/, [u=00AD0e12 ]} 3 {http://www.tailtarget.com/home/, [u=0C010003 ], Technology} {http://cnn.com/news, [u=12070002], News} {http://www.tailtarget.com/about/, [u=00AD0e12 ], Technology} 4 u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology

If you read data from disk, do as much as you can with it

You have unlimited resources in the cloud. But the costs are unlimited too.

If you want to do magic, a flexible service is the key

Amazon EC2 ON-DEMAND $0 $0.32 / hour 2,803.20 RESERVED $1427 $0.104 / hour 2,338.04 SPOT $0 $0.042 / hour ≅367.92 * Costs for 1 year of a Large instance

Savings Peaks of 85 servers per day 60 servers running full time (42 spot) ≈ 62% savings in 1 year

Choosing the instance type well makes a huge difference Make sure you’re monitoring the price variation over time and use this information for future purchases

How to use spot instances Prefer share nothing architectures Choose instances in different zones Mix with on-demand instances Remember: You can lose your machine any time

Auto Scaling And how not to get crazy with it? #BeLazy

Execute youractionsbasedon data

Pragmatic Big Data Architectures in the Cloud : a Developer's Perspective