1 / 62

Pragmatic Big Data Architectures in the Cloud : a Developer's Perspective

Pragmatic Big Data Architectures in the Cloud : a Developer's Perspective . Fabiane Bizinella Nardon (@ fabianenardon ) Fernando Babadopulos (@ babadopulos ). Big Data and Us. Big Data and Us. BIG?. how. big. is. Redis. Mahout. HDFS. HBase. Hive. Hadoop. Pig. Cascading.

cuyler
Download Presentation

Pragmatic Big Data Architectures in the Cloud : a Developer's Perspective

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Pragmatic Big Data Architectures in the Cloud: a Developer's Perspective Fabiane Bizinella Nardon (@fabianenardon) Fernando Babadopulos (@babadopulos)

  2. Big Data and Us

  3. Big Data and Us

  4. BIG? how big is

  5. Redis Mahout HDFS HBase Hive Hadoop Pig Cascading Crunch Cassandra MongoDB MySQL

  6. Redis Mahout HDFS HBase Hive Hadoop Pig Cascading Crunch Cassandra MongoDB MySQL

  7. Disruptive Apps! Big Data + Cloud

  8. Nothing will have a bigger impact on your Big Data application performance than making your own code faster You have unlimited resources in the cloud. But the costs are unlimited too When applying Big Data technologies, make sure your data is really big

  9. Nothing will have a bigger impact on your Big Data application performance than making your own code faster

  10. u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 tailtarget.com – 2 cnn.com - 1

  11. Data MapReduce Map Parallelism Reduce New Data

  12. public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private static final IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); String[] parts = line.split(" "); String url = new URL(parts[2]).getHost(); context.write(new Text(url), one); } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } context.write(key, new IntWritable(count)); } }

  13. Map HDFS Local Storage Chunk 1 Record Reader Map Combine Local Storage Chunk 2 Record Reader Map Combine Reduce Copy Sort Reduce

  14. public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } context.write(key, new IntWritable(count)); } } job.setMapperClass(Mapp.class); job.setCombinerClass(Reduce.class); job.setReducerClass(Reduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.submit();

  15. Naive Implementation With Combiner

  16. public static class Mapp extends Mapper<LongWritable, Text, Text, IntWritable> { private Text url = new Text(); private static final IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) { StringTokenizerst = new StringTokenizer(value.toString(), " "); st.nextToken(); st.nextToken(); url.set(new URL(st.nextToken()).getHost()); context.write(url, one); } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable counter = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) { int count = 0; for (IntWritable value : values) { count = count + value.get(); } counter.set(count); context.write(key, counter); } }

  17. With Combiner Optimized

  18. public static class Mapp extends Mapper<LongWritable, Text, Text, IntWritable> { private Map<String, Integer> items = new HashMap<String, Integer>(); private Text key = new Text(); private IntWritable value = new IntWritable(); public void map(LongWritable key, Text value, Context context) { StringTokenizerst = new StringTokenizer(value.toString(), " "); st.nextToken(); st.nextToken(); String page = new URL(st.nextToken()).getHost(); Integer count = items.get(page); if (count == null) { items.put(page, 1); } else { items.put(page, count + 1); } } public void cleanup(Context context) throws IOException, InterruptedException { for (Entry<String, Integer> item : items.entrySet()) { key.set(item.getKey()); value.set(item.getValue()); context.write(key, value); } } }

  19. Optimized Pre-Combined

  20. The bottleneck usually is caused by the amount of data going across the network

  21. Make sure your architecture is fault tolerant If Map and Reduce have to start over all the time, you won’t get any work done

  22. Hadoop/HDFS do not work well with small files. *We’re talking Big Data, remember?

  23. Data Map MapReduce PIPELINE Reduce

  24. Processing Pipelines u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology

  25. MapReduce Pipelines Redis Mahout Managing HDFS HBase Hive Hadoop Pig Optimizing Cascading Crunch Chaining Cassandra MongoDB MySQL

  26. Crunch Pipeline PTable* PCollection* Data Source Data Target Write DoFN 2 DoFN 1 Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable

  27. Crunch Pipeline PTable* PCollection* Data Source Data Target Write DoFN 2 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable

  28. Crunch Pipeline Ptable* PCollection* Data Source Data Target Write DoFN 1 DoFN 2 DoFN 1 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable

  29. Crunch Pipeline PTable* PCollection* Data Source Data Target Data Target Write DoFN 2 DoFN 1 DoFN 1 DoFN 1 parallelDo() Hadoop Node 1 Hadoop Node 2 HDFS HDFS • PCollection, PTableouPGroupedTable

  30. MapReduce Pipelines u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology

  31. Pipeline Architecture 1 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 2 3 http://www.tailtarget.com/home/ http://cnn.com/news http://www.tailtarget.com/about/ u=0C010003 - http://www.tailtarget.com/home/ u=12070002 - http://cnn.com/news u=00AD0e12 - http://www.tailtarget.com/about/ 4 5 http://www.tailtarget.com/home/ - Technology http://cnn.com/news - News http://www.tailtarget.com/about/ - Technology Merge 6 u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology

  32. Pipeline Architecture 1 u=0C010003 - http://www.tailtarget.com/home/ - 179.203.156.194 u=12070002 - http://cnn.com/news - 189.19.123.161 u=00AD0e12 - http://www.tailtarget.com/about/ - 187.74.232.127 2 {http://www.tailtarget.com/home/, [u=0C010003 ]} {http://cnn.com/news, [u=12070002]} {http://www.tailtarget.com/about/, [u=00AD0e12 ]} 3 {http://www.tailtarget.com/home/, [u=0C010003 ], Technology} {http://cnn.com/news, [u=12070002], News} {http://www.tailtarget.com/about/, [u=00AD0e12 ], Technology} 4 u=0C010003 - Technology u=12070002 - News u=00AD0e12 - Technology

  33. If you read data from disk, do as much as you can with it

  34. You have unlimited resources in the cloud. But the costs are unlimited too.

  35. If you want to do magic, a flexible service is the key

  36. Amazon EC2 ON-DEMAND $0 $0.32 / hour 2,803.20 RESERVED $1427 $0.104 / hour 2,338.04 SPOT $0 $0.042 / hour ≅367.92 * Costs for 1 year of a Large instance

  37. Savings Peaks of 85 servers per day 60 servers running full time (42 spot) ≈ 62% savings in 1 year

  38. Choosing the instance type well makes a huge difference Make sure you’re monitoring the price variation over time and use this information for future purchases

  39. How to use spot instances Prefer share nothing architectures Choose instances in different zones Mix with on-demand instances Remember: You can lose your machine any time

  40. Auto Scaling And how not to get crazy with it? #BeLazy

  41. Execute youractionsbasedon data

More Related