Hadoop Streaming で MapReduce

Hadoop StreamingでMapReduce Naoya Ito naoya at hatena ne jp

アジェンダ • MapReduce とは • MapReduce の計算モデル • Hadoop • Hadoop Streaming • Hadoop Streaming Frontend

MapReduceとは

MapReduce とは • Google • 大規模データを多数のサーバーで処理 • バッチ処理 • 並列分散システム + C++フレームワーク

MapReduce の基本 • 入力ファイルと map(), reduce() の二つの関数を定義して MapReduce システムに処理を依頼 • MapReduce が入力ファイルに対して複数サーバーに分散させて map() → reduce() を処理 • 入力、出力は分散ファイルシステムを介す

MapReduce 図 (Google論文)

MapReduceの構成要素 • クライアント • マスタノード • ワーカーノード • Mapper • Reducer • 分散ファイルシステム (GFS)

MapReduce の計算モデル

MapReduce の計算モデル • key, value ペアに対する map() , reduce() の二つの演算のみで多数の計算問題が解ける

map(), reduce() • 関数型言語 • map関数 : 写像 • reduce関数 : 畳み込み • perl • CORE::map() • List::Util::reduce()

Mapper / Reducer • Mapper • map(key, value) • Reducer • reduce(key, values) • values は iterator

例 • ドキュメントファイルがたくさんある • ドキュメント群に含まれる単語の数を数える

map() map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, "1");

reduce() reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result));

Mapフェーズ Hamlet.txt: To be or not to be, that is the question. map("Hamlet.txt", "To be or not to be ...") To => 1 be => 1 or => 1 ... 中間ファイル

Shuffle フェーズ • Mapper が出力した中間ファイルを Reducer 用の構造に変換する • MapReduce が内部で行う • (key, value) を (key, values) へ • 同じキーの値で key => [ value, value ... ] • キー順にソート

Shuffleフェーズ Reducer入力中間ファイル be => 1, 1 is => 1 not => 1 or => 1 the => 1 that => 1 to => 1, 1 question => 1 To => 1 be => 1 or => 1 not => 1 to => 1 be => 1 that => 1 is => 1 the => 1 question => 1 Shuffle

Reduce フェーズ be => 1, 1 is => 1 not => 1 ... reduce("be", [1, 1]) reduce("is", [1]) reduce("not", [1]) ... be => 2 is => 1 not => 1 ...

MapReduce::Lite • MapReduce の計算モデルを Perl で実装 • 単一ホストで動作 • http://d.hatena.ne.jp/naoya/20080511/1210506301

例: MapReduce::Lite で Apache ログ解析 • HTTPリクエスト回数をステータスコード毎に調べる % perl examples/analog.pl /var/log/httpd/access_log 200 => 4606 304 => 262 404 => 24 500 => 43

Mapper package Analog::Mapper; use Moose; with 'MapReduce::Lite::Mapper'; # $key: 行番号, $value: 行データ sub map { my ($self, $key, $value) = @_; my @elements = split /\s+/, $value; if ($elements[8]) { $self->emit($elements[8], 1); } }

Reducer package Analog::Reducer; use Moose; with 'MapReduce::Lite::Reducer'; sub reduce { my ($self, $key, $values) = @_; $self->emit($key, $values->size); }

main #!/usr/bin/env perl use FindBin::libs; use MapReduce::Lite; my $spec = MapReduce::Lite::Spec->new(intermidate_dir => "./tmp"); for (@ARGV) { my $in = $spec->create_input; $in->file($_); $in->mapper('Analog::Mapper'); } $spec->out->reducer('Analog::Reducer'); $spec->out->num_tasks(3); mapreduce($spec);

Google の MapReduce (C++, Mapper のみ、論文から) #include "mapreduce/mapreduce.h" class WordCounter : public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) && !isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start),"1"); } } }; REGISTER_MAPPER(WordCounter);

MapReduce 図再掲

MapReduce で解ける問題 • 検索エンジンの転置インデックス作成 • grep • ソート • 平均値と分散計算 • PageRank 計算 • PageRank の高いウェブページを検索 • ドキュメント内のリンクの収集 • ...

転置インデックスの構造 # term list of document ID be => [1, 2, 5, 128, 333, 512, 666] is => [1, 3, 8, 9 ] not => [109, 211, 522] or => [18, 32, 200, 412] the => [5, 22, 515 ] that =>[1, 10, 22, 200, 515, 600] ...

MapReduce | MapReduce • 複雑な問題は MapReduce を複数回 • UNIX の PIPE and Filter

Google の MapReduce が凄い点 • 分散処理 • 冗長性 • 負荷分散 (極端に遅いノードは別のノードに) • GFS

分散ファイルシステムと MapReduce • 1TBのデータを入力に... • ネットワーク越しに 1TB 転送? • 分散ファイルシステムがあると • 1TB を 64MB chunk に分割して保持 • ローカルに対象データがあるGFSノードがワーカーになる

MapReduce の良い点 • map(), reduce() を書くだけで良い • 面倒なことは MapReduce が担う • データのソート、入力の分割 etc. • reduce() の入力はキーでソートされるので応用範囲が広い • 巨大なデータに対する処理を細かい処理の集まりに、システムが透過的に変換する • メモリ内で操作可能な程度のサイズ • 大規模データ処理をストリーム的な処理で解ける

MapReduce の性能 • ref: "Google を支える技術", 論文 • DVD 一枚を grep するのに 0.2 秒 • ただしタスク分散のオーバーヘッドあり

MapReduce に関する情報 • Google 論文 • http://labs.google.com/papers/mapreduce.html • "Google を支える技術" • isbn:4774134325 • "Introduction to Information Retrieval" #4 • http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html

Hadoop

Hadoop とは • オープンソースの Google ウェアクローン • MapReduce • HDFS (Hadoop Distributed File System) • Yahoo! Inc, Java, Apache

インストール • 単一ホストでも動作する • 簡単 • http://codezine.jp/a/article/aid/2485.aspx

Hadoop で MapReduce • public static void main ... • 「ここは Kansai.pm だぜ?」

終了

Hadoop Streaming

Hadoop Streaming • Hadoop は Java • MapReduce も Java で書く • Java以外でも MapReduce したい • 入出力を STDIN, STDOUT で扱える拡張 → Hadoop Streaming

Hadoop Streaming • 要するに map.pl と reduce.pl を用意すれば良い

map.pl #!/usr/bin/env perl use strict; use warnings; while (<>) { chomp; my @segments = split /\s+/; printf "%s\t%s\n", $segments[8], 1; }

reduce.pl #!/usr/bin/env perl use strict; use warnings; my %count; while (<>) { chomp; my ($key, $value) = split /\t/; $count{$key}++; } while (my ($key, $value) = each %count) { printf "%s\t%s\n", $key, $value; }

入力準備 /* DFS にログをディレクトリごと転送 */ % hadoop dfs -put /var/log/httpd httpd_logs

実行 % hadoop jar $HADOOP_DIR/contrib/hadoop-0.15.3-streaming.jar \ -input httpd_logs \ -output analog_out \ -mapper /home/naoya/work/analog/map.pl \ -reducer /home/naoya/work/analog/reduce.pl \ -inputformat TextInputFormat \ -outputformat TextOutputFormat

途中経過

出力を取得 /* 出力は DFS 上にある */ % hadoop dfs -cat analog_out/* 304 262 200 4606 500 43 404 24

Hadoop Streaming Frontend

Hadoop Streaming の使いづらい点 • Reducer への入力が構造化されない 200 => [ 1,1,1,1,1,1,1,1,1,... ] 304 => [ 1,1,1,1,1,1,1,1,1,... ] 404 => [ 1,1,1,1,1,1,1,1,1,... ] 500 => [ 1,1,1,1,1,1,1,1,1,... ] 理想 200 1 200 1 ... 304 1 304 1 ... 現実

入力が構造化されないと? • ストリーム的に処理できない • MapReduce の利点の一つが損なわれる • key の数や values の要素数が巨大な場合、メモリ不足 • WWWの転置インデックスなど

Hadoop Streaming で MapReduce