1 / 36

Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009

Data-Intensive Text Processing with MapReduce. (Bonus session). Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009). Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009. Chris Dyer

Download Presentation

Jimmy Lin The iSchool University of Maryland Sunday, May 31, 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data-Intensive Text Processing with MapReduce (Bonus session) Tutorial at 2009 North American Chapter of the Association for Computational Linguistics―Human Language Technologies Conference (NAACL HLT 2009) Jimmy LinThe iSchool University of Maryland Sunday, May 31, 2009 Chris Dyer Department of Linguistics University of Maryland This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

  2. Agenda • Hadoop “nuts and bolts” • “Hello World” Hadoop example(distributed word count) • Running Hadoop in “standalone” mode • Running Hadoop on EC2 • Open-source Hadoop ecosystem • Exercises and “office hours”

  3. Hadoop “nuts and bolts”

  4. Source: http://davidzinger.wordpress.com/2007/05/page/2/

  5. Hadoop Zen • Don’t get frustrated (take a deep breath)… • Remember this when you experience those W$*#T@F! moments • This is bleeding edge technology: • Lots of bugs • Stability issues • Even lost data • To upgrade or not to upgrade (damned either way)? • Poor documentation (or none) • But… Hadoop is the path to data nirvana?

  6. Cloud9 • Library used for teaching cloud computing courses at Maryland • Demos, sample code, etc. • Computing conditional probabilities • Pairs vs. stripes • Complex data types • Boilerplate code for working various IR collections • Dog food for research • Open source, anonymous svn access

  7. Master node Client JobTracker TaskTracker TaskTracker TaskTracker Slave node Slave node Slave node

  8. From Theory to Practice 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster

  9. Data Types in Hadoop Writable Defines a de/serialization protocol. Every data type in Hadoop is a Writable. WritableComprable Defines a sort order. All keys must be of this type (but not values). Concrete classes for different data types. IntWritableLongWritable Text …

  10. Complex Data Types in Hadoop • How do you implement complex data types? • The easiest way: • Encoded it as Text, e.g., (a, b) = “a:b” • Use regular expressions to parse and extract data • Works, but pretty hack-ish • The hard way: • Define a custom implementation of WritableComprable • Must implement: readFields, write, compareTo • Computationally efficient, but slow for rapid prototyping • Alternatives: • Cloud9 offers two other choices: Tuple and JSON • Plus, a number of frequently-used data types

  11. Input file (on HDFS) InputSplit InputFormat RecordReader Mapper Partitioner Reducer RecordWriter OutputFormat Output file (on HDFS)

  12. What version should I use?

  13. “Hello World” Hadoop example

  14. Hadoop in “standalone” mode

  15. Hadoop in EC2

  16. From Theory to Practice 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster

  17. On Amazon: With EC2 0. Allocate Hadoop cluster 1. Scp data to cluster 2. Move data into HDFS EC2 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Your Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster 7. Clean up! Uh oh. Where did the data go?

  18. On Amazon: EC2 and S3 Copy from S3 to HDFS S3(Persistent Store) EC2(The Cloud) Your Hadoop Cluster Copy from HFDS to S3

  19. Open-source Hadoop ecosystem

  20. Hadoop/HDFS

  21. Hadoop streaming

  22. HDFS/FUSE

  23. EC2/S3/EBS

  24. EMR

  25. Pig

  26. HBase

  27. Hypertable

  28. Hive

  29. Mahout

  30. Cassandra

  31. Dryad

  32. CUDA

  33. CELL

  34. Beware of toys!

  35. Exercises

  36. Questions? Comments? Thanks to the organizations who support our work:

More Related