620 likes | 756 Views
Pragmatic Big Data Architectures in the Cloud : a Developer's Perspective . Fabiane Bizinella Nardon (@ fabianenardon ) Fernando Babadopulos (@ babadopulos ). Big Data and Us. Big Data and Us. BIG?. how. big. is. Redis. Mahout. HDFS. HBase. Hive. Hadoop. Pig. Cascading.
E N D
Pragmatic Big Data Architectures in the Cloud: a Developer's Perspective Fabiane Bizinella Nardon (@fabianenardon) Fernando Babadopulos (@babadopulos)
BIG? how big is
Redis Mahout HDFS HBase Hive Hadoop Pig Cascading Crunch Cassandra MongoDB MySQL
Redis Mahout HDFS HBase Hive Hadoop Pig Cascading Crunch Cassandra MongoDB MySQL
Disruptive Apps! Big Data + Cloud
Nothing will have a bigger impact on your Big Data application performance than making your own code faster You have unlimited resources in the cloud. But the costs are unlimited too When applying Big Data technologies, make sure your data is really big
Nothing will have a bigger impact on your Big Data application performance than making your own code faster
u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 tailtarget.com – 2 cnn.com - 1
Data MapReduce Map Parallelism Reduce New Data
public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); String[] parts = line.split(" "); String url = new URL(parts[2]).getHost(); context.write(new Text(url), one); } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } context.write(key, new IntWritable(count)); } }
Map HDFS Local Storage Chunk 1 Record Reader Map Combine Local Storage Chunk 2 Record Reader Map Combine Reduce Copy Sort Reduce
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } context.write(key, new IntWritable(count)); } } job.setMapperClass(Mapp.class); job.setCombinerClass(Reduce.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.submit();
Naive Implementation With Combiner
public static class Mapp extends Mapper<LongWritable, Text, Text, IntWritable> { private Text url = new Text(); private static final IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) { StringTokenizerst = new StringTokenizer(value.toString(), " "); st.nextToken(); st.nextToken(); url.set(new URL(st.nextToken()).getHost()); context.write(url, one); } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable counter = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } counter.set(count); context.write(key, counter); } }
With Combiner Optimized
public static class Mapp extends Mapper<LongWritable, Text, Text, IntWritable> { private Map<String, Integer> items = new HashMap<String, Integer>(); private Text key = new Text(); private IntWritable value = new IntWritable(); public void map(LongWritable key, Text value, Context context) { StringTokenizerst = new StringTokenizer(value.toString(), " "); st.nextToken(); st.nextToken(); String page = new URL(st.nextToken()).getHost(); Integer count = items.get(page); if (count == null) { items.put(page, 1); } else { items.put(page, count + 1); } } public void cleanup(Context context) throws IOException, InterruptedException { for (Entry<String, Integer> item : items.entrySet()) { key.set(item.getKey()); value.set(item.getValue()); context.write(key, value); } } }
Optimized Pre-Combined
The bottleneck usually is caused by the amount of data going across the network
Make sure your architecture is fault tolerant If Map and Reduce have to start over all the time, you won’t get any work done
Hadoop/HDFS do not work well with small files. *We’re talking Big Data, remember?
Data Map MapReduce PIPELINE Reduce
Processing Pipelines u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology
MapReduce Pipelines Redis Mahout Managing HDFS HBase Hive Hadoop Pig Optimizing Cascading Crunch Chaining Cassandra MongoDB MySQL
Crunch Pipeline PTable* PCollection* Data Source Data Target Write DoFN 2 DoFN 1 Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable
Crunch Pipeline PTable* PCollection* Data Source Data Target Write DoFN 2 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable
Crunch Pipeline Ptable* PCollection* Data Source Data Target Write DoFN 1 DoFN 2 DoFN 1 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable
Crunch Pipeline PTable* PCollection* Data Source Data Target Data Target Write DoFN 2 DoFN 1 DoFN 1 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable
MapReduce Pipelines u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology
Pipeline Architecture 1 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 2 3 http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ 4 5 http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology Merge 6 u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology
Pipeline Architecture 1 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 2 {http://www.tailtarget.com/home/, [u=0C010003 ]} {http://cnn.com/news, [u=12070002]} {http://www.tailtarget.com/about/, [u=00AD0e12 ]} 3 {http://www.tailtarget.com/home/, [u=0C010003 ], Technology} {http://cnn.com/news, [u=12070002], News} {http://www.tailtarget.com/about/, [u=00AD0e12 ], Technology} 4 u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology
You have unlimited resources in the cloud. But the costs are unlimited too.
Amazon EC2 ON-DEMAND $0 $0.32 / hour 2,803.20 RESERVED $1427 $0.104 / hour 2,338.04 SPOT $0 $0.042 / hour ≅367.92 * Costs for 1 year of a Large instance
Savings Peaks of 85 servers per day 60 servers running full time (42 spot) ≈ 62% savings in 1 year
Choosing the instance type well makes a huge difference Make sure you’re monitoring the price variation over time and use this information for future purchases
How to use spot instances Prefer share nothing architectures Choose instances in different zones Mix with on-demand instances Remember: You can lose your machine any time
Auto Scaling And how not to get crazy with it? #BeLazy