460 likes | 717 Views
大规模数据处理 / 云计算 Lecture 4 – Word Co-occurrence Matrix. 彭波 北京大学信息科学技术学院 7/10/2014 http://net.pku.edu.cn/~course/cs402/. Jimmy Lin University of Maryland. SEWMGroup.
E N D
大规模数据处理/云计算Lecture 4 – Word Co-occurrence Matrix 彭波 北京大学信息科学技术学院 7/10/2014 http://net.pku.edu.cn/~course/cs402/ Jimmy Lin University of Maryland SEWMGroup This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Problems&Solutions • Mac OS • Mac OS X下配置心得,by Xin Lv • Eclipse • eclipse3.7Indigo连接hadoop心得, by 朱瑜坚 • Linux • Linux下手动配置运行hadoop心得, by Haoyan Huo • VMPlayer • 暂缺,:)
Homework Submission • What to hand in • Please pack the ACCEPTED source codes of oneline evalution into a single rar/tar.gz file, name it as "assign1-YourPinYinName.rar" or "assign1-YourPinYinName.tar.gz" and send the package to our TA by email (cs402.pku AT gmail.com) with "CS40214-Assign1-YourPinYinName" as the subject.
Changping11使用规范 • hadoop.job.ugi = YourName, cs402 • 输入数据在: • /public • 自己传上去的数据,放在个人目录下 • 输出数据,一定放在/cs402的个人目录下 • /cs402/YourName • 不要使用默认的/user/Yourname
Hadoop Streaming • Hadoop streaming is a utility that comes with the Hadoop distribution. The utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.
How Does Streaming Work • both the mapper and the reducer are executables that read the input from stdin (line by line) and emit the output to stdout • By default, the prefix of a line up to the first tab character is the key and the rest of the line (excluding the tab character) will be the value
More Features • Specifying Other Plugins for Jobs • inputformat JavaClassName • outputformat JavaClassName • partitioner JavaClassName • combiner JavaClassName • Specifying Additional Configuration Variables for Jobs • Customizing the Way to Split Lines into Key/Value Pairs • A Useful Partitioner Class • A Useful Comparator Class • Working with the Hadoop Aggregate Package
What Constitutes Progress in MapReduce? • Hadoop will not fail a task that’s making progress. • Reading an input record (in a mapper or reducer) • Writing an output record (in a mapper or reducer) • Setting the status message (using Context’s setStatus() method) • Incrementing a counter (using Context’s getCounter().increment() method) • Calling Reporter’s progress() method
Counters & Status Message • Counters are a useful channel for gathering statistics about the job: for quality control or for application-level statistics. • Status Message
Hadoop Logs • MapReduce task logs • Each tasktracker child process produces a logfile using log4j (called syslog), a file for data sent to standard out (stdout), and a file for standard error (stderr). • accessible through the web UI
r1 c a b r2 k1 r3 a b c a s2 s3 s1 v1 7 5 1 3 2 2 1 c c b c k2 3 7 5 8 2 2 6 v2 6 8 k3 v3 k4 v4 k5 v5 k6 v6 map map map map Shuffle and Sort: aggregate values by keys reduce reduce reduce 18
But, in a real system...... • How to inject user code into a runing system? • job submission • mapper & reducer class instantiate • read/write data in mapper& reducer
Implementation in Hadoop • job submission • mapper & reducer class instantiate • read/write data in mapper& reducer
Asks the jobtracker for a new job ID (by calling getNewJobId() on JobTracker) (step2). • Checks the output specification of the job. For example, if the output directory has not been specified or it already exists, the job is not submitted and an error is thrown to the MapReduce program. • Computes the input splits for the job. If the splits cannot be computed (because the input paths don’t exist, for example), the job is not submitted and an error is thrown to the MapReduce program.
Copies the resources needed to run the job, including the job JAR file, the configuration file, and the computed input splits, to the jobtracker’s filesystem in a directory named after the job ID. • The job JAR is copied with a high replication factor (controlled by the mapred.submit.replication property, which defaults to 10) so that there are lots of copies across the cluster for the tasktrackers to access when they run tasks for the job (step 3). • Tells the jobtracker that the job is ready for execution by calling submitJob() on JobTracker (step 4).
InputSplits • input split • is a chunk of the input that is processed by a single map. • Each map processes a single split. • Each split is divided into records, and the map processes each record—a key-value pair—in turn. • Splits and records are logical:
InputFormat • An InputFormat is responsible for creating the input splits and dividing them into records.
Serialization • Serialization is the process of turning structured objects into a byte stream for trans-mission over a network or for writing to persistent storage. • Deserialization is the reverse process of turning a byte stream back into a series of structured objects. • In Hadoop, interprocess communication between nodes in the system is implemented using remote procedure calls (RPCs).
The Writable Interface • public interface Writable { • void write(DataOutput out) throws IOException; • void readFields(DataInput in) throws IOException; } • public interface WritableComparable<T> extends Writable, Comparable<T> • A Writable which is also Comparable. • public int compareTo(WritableComparable w){}
Tasks • Do word co-occurrenceanalysis on ShakeSpeare Collection and AP Collection, which is under the directory of /public/Shakespeare and /public/AP of our sewm cluster (or your own virtual cluster). You will get one line of text data as input to process in map function by default.(80 points) • Try to optimize your program, and find the fastest one. Write your approaches and evaluation in your report.(20 points) • Analysis the result data matrix and find something interesting. (10 points bonus) • Write a report to describe approach to each task, the problem you met etc.
co-occurrence • Co-occurrence or cooccurrence is a linguistics term that can either mean concurrence / coincidence or, in a more specific sense, the above-chance frequent occurrence of two terms from a text corpus alongside each other in a certain order. Co-occurrence in this linguistic sense can be interpreted as an indicator of semantic proximity or an idiomatic expression. In contrast to collocation, co-occurrence assumes interdependency of the two terms. A co-occurrence restriction is identified when linguistic elements never occur together. Analysis of these restrictions can lead to discoveries about the structure and development of a language.[1] From Wikipedia, the free encyclopedia
Pairs .vs. Stripes (a, b) → 1 (a, c) → 2 (a, d) → 5 (a, e) → 3 (a, f) → 2 a → { b: 1, c: 2, d: 5, e: 3, f: 2 } a → { b: 1, d: 5, e: 3 } a → { b: 1, c: 2, d: 2, f: 2 } a → { b: 2, c: 2, d: 7, e: 3, f: 2 } + Key: cleverly-constructed data structure brings together partial results • Idea: group together pairs into an associative array • Each mapper takes a sentence: • Generate all co-occurring term pairs • For each term, emit a → { b: countb, c: countc, d: countd … } • Reducers perform element-wise sum of associative arrays 40
Pairs • customized KEY • (a,b) TextPair that implements WritableComparable<> • customized Partitioner • all (a,b) (a,c) (a,f) (a,*) go to the same Reducer • Default partitioner HashPartitioner use the hashCode() method
a c r3 c b a r2 c a a b a c b r1 k1 s2 s3 s1 v1 5 5 7 7 3 1 9 1 2 2 1 2 c c c c b c b k2 9 3 7 5 2 8 8 2 6 2 2 v2 8 6 8 k3 v3 k4 v4 k5 v5 k6 v6 map map map map combine combine combine combine partition partition partition partition Shuffle and Sort: aggregate values by keys reduce reduce reduce 42
Partitioner • public abstract class Partitioner<KEY,VALUE>{ public int getPartition(KEY key, VALUE value, int numPartitions) } • job setup • job.setPartitionerClass(UserPartitioner.class);
Comparators • Comparable<T> • compareTo() • Comparator • RawComprator<> • WritableComparator • job.setSortComparatorClass • sort map() output in Mapper • job.setGroupingComparatorClass • sort shuffled data in Reducer, group result sent to reduce()
Stripes • associative array • map -> MapWritable? • Caution: • jvm memory big enough • set mapred.child.java.opts (200M by default)