1 / 11

Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop

Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop. Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao 2008-05-08. The REAP Project. REAP is an intelligent tutor for English language learning

vin
Download Presentation

Hadoop(MapReduce) in the Wild —— Our current understandings & uses of Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop(MapReduce) in the Wild—— Our current understandings & uses of Hadoop Le Zhao, Changkuk Yoo, Mark Hoy, Jamie Callan Presenter: Le Zhao 2008-05-08

  2. The REAP Project • REAP is an intelligent tutor for English language learning • Intelligent tutors often use student models to generate individualized instruction for each student • REAP cannot generate texts, but it can recognize texts • POS tagging, NE tagging, search, categorization, … • #filreq( #band( #greater( textquality 85 )  #greater( readinglevel 2 ) #less( readinglevel 9 )  #less( doclength 1001 ) ) #combine( advocate evidence destroy propose … acceptance )  ) • So, REAP needs a very large database of annotated texts • Previous approach to collecting texts didn’t scale well • Gathered ~1 M high quality docs in ~1 year • Typical yield rate <1%, for docs fitting tutoring criteria

  3. Tasks Done on Hadoop • Web crawling of 200 million Web documents • Two web crawls of 100 million web pages each • Text annotation and categorization of Web pages • Part-of-speech, named-entity, sentence breaking • Reading level, text quality, topic classification • Output: {6 TB} + {42 GB} offset annotation • Filtering documents according to text quality • Output: 6 million high quality HTML documents (114 GB) • Generating graph structure & PageRank • Class project

  4. Getting Started With Hadoop Quickly Hadoop Streaming has been our most important tool for porting legacy tasks and tools to hadoop • Runs any program with STDIN -> STDOUT • No need to recompile or relink with hadoop libraries • For 1 file/record streaming, Not the most efficient implementation • Poor data locality • But very efficient in human time • A day or two to get something running on 100 nodes

  5. Hadoop in the Wild:Trick #1 • Q: My annotator takes file input, not STDIN • Solution: still Hadoop Streaming • Prepare a list of filenames • Distribute the filenames instead of file contents • map.pl • takes one filename • download the file from HDFS (Hadoop Distributed File System) • apply the annotator • upload resulting files to HDFS • No reducer needed • Can port any data distributive program onto Hadoop in a day • Efficient enough for computation intensive tasks • Even though low data locality

  6. Trick #2 • Q: My annotator is a directory of programs, but Hadoop Streaming only accepts files. • Solution: still Hadoop Streaming • Make a tar ball of your directory of programs • map.pl needs to extract & launch the program

  7. Trick #3 • Q: Hadoop programs are running on backend nodes, and are difficult to debug • Use STDERR for debugging • Also, if using HOD for managing the cluster • Views STDERR thru Web monitoring interface • Sees time spent on each Map/Reduce task

  8. Pitfall #1:It’s All a Matter of Balance • For higher performance: • It is important to have the right balance between Map & Reduce tasks • The default number of Map/Reduce processes per node is 2 • But some multicore / multiprocessor nodes can easily handle more (e.g., 6 on M45) There is no good way to determine the right balance, except by parameter sweeps

  9. Pitfall #2:Things Die, No Idea Why • Fault Tolerance and Diagnosis • If a Reduce task becomes unresponsive, it is killed • E.g., if it is overwhelmed with work • E.g., if its sort task is overwhelmed with work • Diagnosing the cause of an unresponsive Reduce process is not always easy • Sometimes solved by increasing number of reducers

  10. Unsolved Problems • Monitoring cluster for diagnostics • CPU, Network, Disk I/O, Swap, etc. • Simon web interface, but not working.. • HOD (or Torque?) does not allow scheduling and prioritizing jobs • Reduce happens in a few nodes, waiting time for other idle nodes can be long. • Shuffle & sort is opaque • yet another black box

  11. Thanks! Comments? Ideas?

More Related