110 likes | 138 Views
Gang Luo Sept. 2, 2010. MapReduce Programming and Cluster Accessing Instructions. Dataflow. (K1, V1). (K2, V2). (K2, List<V2>). (K3, V3). A Query Example. Table1. SELECT Year, MAX(Temperature) FROM Table1 WHERE AirQuality = 0|1|4|5|9 GROUPBY Year. Implementation in MapReduce.
E N D
Gang Luo Sept. 2, 2010 MapReduce ProgrammingandCluster Accessing Instructions
Dataflow (K1, V1) (K2, V2) (K2, List<V2>) (K3, V3)
A Query Example Table1 SELECT Year, MAX(Temperature) FROM Table1 WHERE AirQuality = 0|1|4|5|9 GROUPBY Year
Implementation in MapReduce Selection+ Projection Aggregation (MAX) (1998, 87, 2, …) (1998, 87) 87 94 1998, 84 87 78 (1998, 94)
Think more! • What if we want to get the average temperature for a year? • What if you are only interested in the temperature in Durham? (Assume the station ID at Durham is 212) You may want to change the code a little bit and fulfill a different query
Hadoop Cluster • Master node: • hadoop21.cs.duke.edu • Slave nodes • hadoop22.cs.duke.edu – hadoop36.cs.duke.edu • Online job tracker* • hadoop21.cs.duke.edu:50030 • Online HDFS info* • hadoop21.cs.duke.edu:50070 *You cannot access these pages outside CS trusted network. Solution: 1. ssh to any node, use lynx. 2. build “ssh -D port” connection to any node, set proxy in your browser
Now, let’s see how to compile and run a MapReduce job in a clusterWhat I will be showing you is covered by the instructions at the course website:http://www.cs.duke.edu/courses/fall10/cps216/Project/cluster_instruction