440 likes | 675 Views
Hadoop, HDFS and Microsoft Cloud Computing Technologies. Cloud Computing Systems. Lin Gu. Hong Kong University of Science and Technology. Oct. 3, 2011. The Microsoft Cloud. Application Services. Categories of Services. Software Services. Platform Services. Infrastructure Services.
E N D
Hadoop, HDFS and Microsoft Cloud Computing Technologies Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Oct. 3, 2011
The Microsoft Cloud Application Services Categories of Services Software Services Platform Services Infrastructure Services
Application Patterns Grid / Parallel Computing Application User Web Browser Mobile Browser Silverlight Application WPF Application Web Svc (Web Role) ASP.NET (Web Role) Jobs (Worker Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) Private Cloud ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) Public Services Enterprise Application ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) Application Service Enterprise Web Svc Data Service Table Storage Service Blob Storage Service Queue Service Enterprise Data Storage Service Enterprise Identity Identity Service User Data Application Data Reference Data Service Bus Access Control Service Workflow Service
Hadoop—History • Started in 2005 by Doug Cutting • Yahoo! became the primary contributor in 2006 • Scaled it to 4000 node clusters in 2009 • Yahoo! deployed large-scale science clusters in 2007 • Many users today • Amazon/A9 • Facebook • Google • IBM • Joost • Last.fm • New York Times • PowerSet • Veoh
Hadoop at Facebook • Production cluster comprises 8000 cores, 1000 machines, 32 GB per machine (July 2009) • 4 SATA disks of 1 TB each per machine • 2-level network hierarchy, 40 machines per rack • Total cluster size is 2 PB (projected to be 12 PB in Q3 2009) • Another test cluster has 800 cores “16GB each” Source: Dhruba Borthakur
Hadoop—Motivation • Need a general infrastructure for fault-tolerant, data-parallel distributed processing • Open-source MapReduce • Apache License • Workloads are expected to be IO bound and not CPU bound
First, a file system is in need—HDFS • Very large distributed file system running on commodity hardware – Replicated – Detect failures and recovers from them • Optimized for batch processing – High aggregate bandwidth, locality aware • User-space FS, runs on heterogeneous OS
HDFS 1. filename NameNode 2. BlckId, DataNodes Secondary NameNode Client 3.Read data Cluster DataNodes NameNode : Manage metadata DataNode : manage file data—Maps a block-id to a physical location on disk Secondary NameNode: fault tolerance—Periodically merge the Transaction log
HDFS • Provide a single namespace for entire cluster – Files, directories, and their hierarchy • Files are broken up into large blocks – Typically 128 MB block size – Each block is replicated on multiple DataNodes • Meta-data in Memory • Metadata: Names of files (including dirs) and a list of Blocks for each file, list of DataNodes for each block, file attributes, e.g creation time, replication factor • High performance (high throughput, low latency) • A Transaction Log records file creations, file deletions. etc • Data Coherency: emphasizes the append operation • Client can –find location of blocks –access data directly from DataNode
Hadoop—Design • Hadoop Core • Distributed File System - distributes data • Map/Reduce - distributes logic (processing) • Written in Java • Runs on Linux, Mac OS/X, Windows, and Solaris • Fault tolerance • In a large cluster, failure is norm • Hadoop re-executes failed tasks • Locality • Map and Reduce in Hadoop queries HDFS for locations of data • Map tasks are scheduled close to the inputs when it is possible
Hadoop Ecosystem • Hadoop Core • Distributed File System • MapReduce Framework • Pig (initiated by Yahoo!) • Parallel Programming Language and Runtime • Hbase (initiated by Powerset) • Table storage for semi-structured data • Zookeeper (initiated by Yahoo!) • Coordinating distributed systems • Storm • Hive (initiated by Facebook) • SQL-like query language and storage
Word Count Example • Read text files and count how often words occur. • The input is a collection of text files • The output is a text file • each line: word, tab, count • Map: Produce pairs of (word, count) • Reduce: For each word, sum up the counts.
WordCount Overview public class WordCount { 14 public static class Map extends MapReduceBase implements Mapper ... { 17 18 public void map ... 26 } 27 28 public static class Reduce extends MapReduceBase implements Reducer ... { 29 30 public void reduce ... 37 } 38 39 public static void main(String[] args) throws Exception { 40 JobConf conf = new JobConf(WordCount.class); 41 ... 53 FileInputFormat.setInputPaths(conf, new Path(args[0])); 54 FileOutputFormat.setOutputPath(conf, new Path(args[1])); 55 56 JobClient.runJob(conf); 57 } }
wordCount Mapper 14 public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 15 private final static IntWritable one = new IntWritable(1); 16 private Text word = new Text(); 17 18 public void map( LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 19 String line = value.toString(); 20 StringTokenizer tokenizer = new StringTokenizer(line); 21 while (tokenizer.hasMoreTokens()) { 22 word.set(tokenizer.nextToken()); 23 output.collect(word, one); 24 } 25 } 26 }
wordCount Reducer 28 public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 29 30 public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 31 int sum = 0; 32 while (values.hasNext()) { 33 sum += values.next().get(); 34 } 35 output.collect(key, new IntWritable(sum)); 36 } 37 }
Invocation of wordcount • /usr/local/bin/hadoop dfs -mkdir <hdfs-dir> • /usr/local/bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir> • /usr/local/bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir>
ExampleHadoop Applications: Search Assist™ • Database for Search Assist™ is built using Hadoop. • 3 years of log-data, 20-steps of map-reduce
Large Hadoop Jobs Source: Eric Baldeschwieler, Yahoo!
Data Warehousing at Facebook Web Servers Scribe Servers Network Storage Oracle RAC Hadoop Cluster MySQL • 15 TB uncompressed data ingested per day • 55TB of compressed data scanned per day • 3200+ jobs on production cluster per day • 80M compute minutes per day Source: Dhruba Borthakur
But all these are data analytics applications. Can it extend to general computation? How to construct a simple, generic, and automatic parallelization engine for the cloud? Let’s look at an example...
The Tomasulo’s Algorithm • Designed initially for IBM 360/91 • Out-of-order execution • The descendants of this include: Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …
Three Stages of Tomasulo Algorithm 1. Issue—get instruction from a queue • Record the instruction’s information in the processor’s internal control, and rename registers 2. Execute—operate on operands (EX) • When all operands are ready, execute;otherwise, watch Common Data Bus (CDB) for result 3. Write result—finish execution (WB) • Write result to CDB. All awaiting units receive the result.
FP Registers From Mem FP Op Queue Load Buffers Load1 Load2 Load3 Load4 Load5 Load6 Store Buffers Add1 Add2 Add3 Mult1 Mult2 Reservation Stations To Mem FP adders FP multipliers Common Data Bus (CDB) Tomasulo organization
How does Tomasulo exploit parallelism? • Naming and renaming • Keep track of data dependence and resolve conflicts by renaming registers. • Reservation stations • Record instructions’ control information and the values of operands. Data has versions. • In Tomasulo, data drive logic When data is ready, execute!
Dryad • Distributed/parallel execution • Improve throughput, not latency • Automatic management of scheduling, distribution, fault tolerance, and parallelization! • Computations are expressed as a DAG • Directed Acyclic Graph: vertices are computations, edges are communication channels • Each vertex has several input and output edges
Why using a dataflow graph? • A general abstraction of computation • The programmer may not have to know how to construct the graph • “SQL-like” queries: LINQ Can all computation be represented by a finite graph?
Yet Another WordCount, in Dryad Count Word:n MergeSort Word:n Distribute Word:n Count Word:n
Job as a DAG (Directed Acyclic Graph) Outputs Processing vertices Channels (file, pipe, shared memory) Inputs
Scheduling at JM • A vertex can run on any computer once all its inputs are ready • Prefers executing a vertex near its inputs (locality) • Fault tolerance • If a task fails, run it again • If task’s inputs are gone, run upstream vertices again (recursively) • If a task is slow, run another copy elsewhere and use the output from the faster computation
Distributed Data-Parallel Computing • Research problem: How to write distributed data-parallel programs for a compute cluster? • The DryadLINQ programming model • Sequential, single machine programming abstraction • Same program runs on single-core, multi-core, or cluster • Familiar programming languages • Familiar development environment
LINQ • Is SQL Turing-complete? Is LINQ? • LINQ: A language for relational queries • Language INtegrated Query • More general than distributed SQL • Inherits flexible C# type system and libraries • Available in Visual Studio products • A set of operators to manipulate datasets in .NET • Support traditional relational operators • Select, Join, GroupBy, Aggregate, etc. • Integrated into .NET • Programs can call operators • Operators can invoke arbitrary .NET functions • Data model • Data elements are strongly typed .NET objects • More expressive than SQL tables
LINQ + Dryad = DryadLINQ Collection<T> collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; Vertexcode Queryplan (Dryad job) Data collection C# C# C# C# results
DryadLINQ System Architecture Client Cluster Dryad DryadLINQ .NET program Distributedquery plan Invoke Query Expr Query plan Vertexcode Input Tables ToTable Dryad Execution Output Table .Net Objects Results Output Tables foreach (11)
Yet Yet Another Word Count Count word frequency in a set of documents: var docs = [A collection of documents]; var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count()));
Word Count in DryadLINQ Count word frequency in a set of documents: var docs = DryadLinq.GetTable<Doc>(“file://docs.txt”); var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count())); counts.ToDryadTable(“counts.txt”);
Distributed Execution of Word Count LINQ expression Dryad execution IN SM DryadLINQ GB S OUT
DryadLINQ Design • An optimizing compiler generates the distributed execution plan • Static optimizations: pipelining, eager aggregation, etc. • Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. • Automatic code generation and distribution by DryadLINQ and Dryad • Generates vertex code that runs on vertices, channel serialization code, callback code for runtime optimizations • Automatically distributed to cluster machines
Summary • DAG dataflow graph is a powerful computation model • Language integration enables programmers to easily use the DAG based computation • Decoupling of Dryad and DryadLINQ • Dryad: execution engine (given DAG, schedule tasks and handle fault tolerance) • DryadLINQ: programming language and tools (given query, generates DAG)
Development • Works with any LINQ enabled language • C#, VB, F#, IronPython, … • Works with multiple storage systems • NTFS, SQL, Windows Azure, Cosmos DFS • Released within Microsoft and used on a variety of applications • External academic release announced at PDC • DryadLINQ in source, Dryad in binary • UW, UCSD, Indiana, ETH, Cambridge, …
Advantages of DAG over MapReduce • Dependence is naturally specified • MapReduce: complex job runs >=1 MR stages • Tasking overhead • Reduce tasks of each stage write to replicated storage • Dryad: each job is represented with a DAG • intermediate vertices written to local file • Dryad provides a more flexible and general framework • E.g., multiple types of input/output
DryadLINQ in the Software Stack … MachineLearning Image Processing Graph Analysis DataMining Applications Other Applications DryadLINQ Other Languages Dryad CIFS/NTFS SQL Servers Azure Platform Cosmos DFS Cluster Services Windows Server Windows Server Windows Server Windows Server