500 likes | 715 Views
大规模数据处理 / 云计算 Lecture 2 – "Hello World" in Hadoop. 彭波 北京大学信息科学技术学院 7/3/2014 http://net.pku.edu.cn/~course/cs402/. Jimmy Lin University of Maryland. SEWMGroup.
E N D
大规模数据处理/云计算Lecture 2 – "Hello World" in Hadoop 彭波 北京大学信息科学技术学院 7/3/2014 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
CodeLab1 • 遇到的困难 • 不熟悉java! • 开发和运行环境搭建?(eclipse, hadoop) • guide里面的代码编译报错? • 运行时报错? • 。。。。。。。。。
貌似pdf里给的代码不能用,点那个“source code here”出来的代码是能用的……呃……不过我跑出来的结果和pdf里的不一样…… • The method setInputPath(Path) is undefined for the type JobConf WordCount/src WordCount.java line 21 1404272734726 310不知道什么原因。。 • 编译通不过 求助 • FileInputPath cannot be resolvedFileOutputPath cannot be resolved这是什么情况。 • Exception in thread "main" java.io.IOException: Cannot run program "chmod": CreateProcess error=2, ?????????我运行的时候报的这个错误
Historical background • The C programming language • early 1970s • UNIX • The C++ programming language • early 1980s • object-oriented • a wide variety of application programming • The Java programming language • early 1990s • originally for consumer electronic devices • enterprise application development
Java SDK • Software Development Kit • a group of command-line tools and packages that you will need to write and run Java programs • base classes (Library)
Working with the SDK • Factorial • input: a value as a command-line argument • output: factorial of that number OR exception • Java Specification • every Java source code file must have the exact same name as the class that is defined inside of it
Primitive data types • Char • 16 bits • Unicode character set • escape sequences
Primitive data types • integer types • signed • exact size
Primitive data types • The floating-point types • IEEE 754 floating-point values
Primitive data types • The boolean types • true, false
Operators • + is overloaded • If you use the + operator with a String and another operand that is not a String, the other operand is converted into a String
C/C++ functions versus Java methods • In Java terminology, functions are called methods. • Methods can only be declared as members of a class; you can't define a method outside of a Java class
Arrays • objects, so they are declared using the new operator • scores.length • the bracket characters ([ ]) that are used to indicate arrays are bound to the array type, not the array name • java.lang.ArrayIndexOutOfBounds exception
Strings • objects of the String class • String objects are immutable • same string literals • String class has a rich interface
The main() method • a strict naming convention • first element in the array is the first argument, not the name of the program.
Other differences • Pointers: • Java references are pointers to Java objects • cannot be incremented or decremented • no address of operators • Global variables • no way to declare global variables (or methods) • no struct, union, typedef, enum • Freely placed methods • Garbage collection • no malloc() and free()
Defining a Java class • Each member must have its own public or private modifier • You don't use semicolons (;) after the closing brackets in class and method definitions. • The main() method is a member of the class • You call the constructor using the new keyword
access modifiers • public • private • protected • package access
Inheritance • extends • super()
The Object class • All Java classes are ultimately subclasses of class Object • a centrally rooted class hierarchy • usage • toString() • define data structures that take objects of class Object , it can hold any Java object .vs. C++ template
Interfaces • All interfaces are implicitly abstract • All members of an interface are implicitly public • All fields defined in an interface are implicitly static and final • A Java class can extend only one class, but it can implement any number of interfaces • Best practice for polymorphism
more on objects • Inner classes and inner interfaces • Anonymouse classes and objects
Using Library(Java API) • Java API, classes are grouped into packages • you already been using classes from a default package: java.lang when call System.out.println() • import java.util.ArrayList; or java.util.ArrayList<xx> list = ....
Data Structures • java.util.* • java generics
Deploying your application • A Java program is a bunch of classes. • A JAR file is Java Archive • create a manifest.txt state which class has main() method • Main-Class: MyApp • use jar tool to package all classes files and manifest.txt • $jar -cvmf manifest.txt app.jar *.class • $java -jar app.jar
Package • put your classes in packages • java.util, java.net, java.text .... • preface your package with your reverse domain name • setup a matching directory structure
References • 《Java programming for C C++ developers》 • 《Head First Java》
What is MapReduce? Programming model for expressing distributed computations at a massive scale Execution framework for organizing and performing such computations Open-source implementation called Hadoop 40
Brief History of Hadoop • Hadoop was created by Doug Cutting, the creator of Apache Lucene/Nutch, • 2003, Google published GFS • 2004, Google published MapReduce • 2005, Nutch ported to Mapreduce/HDFS • 2006, Cutting join Yahoo! • 2008.1, Hadoop became top-level project at Apache • 2008.2, Hadoop run on 10000-core cluster
New MapReduce API • favors abstract classes over interfaces • new API in org.apache.hadoop.mapreduce, old in org.apache.hadoop.mapred • new Context class • JobConf, OutputCollector,Reporter • new Job class • JobClient • reduce() method passes values • new: java.lang.Iterable, for (VALUEIN value : values) { ... } • old: java.lang.Iterator, hasNext(), next()
Hadoop Streaming & Pipes • Streaming • support any programming language, even shell scripts • uses standard input and output to communicate with the map and reduce code • Pipes • C++ interface to Hadoop MapReduce • uses sockets as the communication channel
Hadoop Command • docs in distribution • api • tutorial • hadoop • -conf xxx
Changping Cluster • 28 Nodes, 12 Cores/48GB RAM/10T DISK • Namenode/JobTracker server - changping11 • ip : 222.29.134.11 • hdfs port : 9000 • mapreduce port: 9001
How to use ChangpingCluster • 1. 添加一个域名解析 • windows: 编辑 C:\WINDOWS\system32\drivers\etc\hosts 文件, • linux : /etc/hosts 添加一行如下: 222.29.134.11 changping11 • 否则运行 job 会报告名字解析错误
How to use ChangpingCluster • 2. 身份设置 • 1). 输出文件统一到 "/cs402/YourName"目录下 • 代码中是:FileOutputFormat.setOutputPath(conf, new Path("/cs402/YourName")); • 2). Mapred Location里设置好hadoop.job.ugi = YourName, cs402 • 用户名和上面文件路径中的名字一致, • 组名必须是 cs402 • 或者在driver程序里直接设置好。 • Configuration conf = new Configuration(); • conf.set("hadoop.job.ugi", "YourName,cs402");
References • Tom White, Hadoop: The Definitive Guide, O'Reilly, 3rd, 2012.5.