140 likes | 403 Views
Outline. IntroductionWhat do we mean by ?Grid'?Technology OverviewTechnologiesHDFSHadoop Map-ReduceHadoop on Demand. Introduction. What do we mean by ?Grid'?. Computing platform that can support many distributed applicationsRuns on dedicated clusters of commodity PCs (a Grid)Hardware can be
E N D
1. Grid Computing at Yahoo! Sameer Paranjpye
Mahadev Konar
Yahoo!
2. Outline Introduction
What do we mean by Grid?
Technology Overview
Technologies
HDFS
Hadoop Map-Reduce
Hadoop on Demand
3. Introduction
4. What do we mean by Grid? Computing platform that can support many distributed applications
Runs on dedicated clusters of commodity PCs (a Grid)
Hardware can be dynamically allocated to a job
Plan to support many applications per Grid
Good for Batch data processing
Log Processing
Document Analysis and Indexing
Web Graphs and Crawling
Large scale a primary design goal
10,000 PCs / Grid a design goal (working @ 1000 now)
Very large data (10 Petabyte storage a design goal)
100+ TB inputs to a single job
Bandwidth to data is a significant design driver
Large production deployments
Number of CPUs that can be applied gates what you can do
Several clusters of 1000s of nodes
5. Technology Overview Hadoop (Our primary Grid project)
An open source apache project, started by Doug Cutting
HDFS, a distributed file system
Implementation of Map-Reduce programming model
http://lucene.apache.org/hadoop
HOD (Hadoop-on-Demand)
Adaptor that runs Hadoop tools on batch systems
Hadoop expressed as a parallel job
Manages setup, startup, shutdown and cleanup of Hadoop
Currently supports Condor and Torque
6. Technologies
7. HDFS - Hadoop Distributed FS Very Large Distributed File System
We plan to support 10k nodes and 10 PB data
Current deployment of 1k+ nodes, 1PB data
Assumes commodity hardware that fails
Files are replicated to handle hardware failure
Checksums for corruption detection and recovery
Continues operation as nodes / racks added / removed
Optimized for fast batch processing
Data location exposed to allow computes to move to data
Stores data in chunks on every node in the cluster
Provides VERY high aggregate bandwidth
8. Hadoop DFS Architecture
9. Hadoop Map-Reduce Implementation of the Map-Reduce programming model
Framework for distributed processing of large data sets
Resilient to nodes failing and joining during a job
Great for web data and log processing
Pluggable user code runs in generic reusable framework
Input records are transformed, sorted and combined to produce a new output
All actions plugable / configurable
A reusable design pattern
Input | Map | Shuffle | Reduce | Output
(example)
cat * | grep | sort | unique -c > file
10. HOD (Hadoop on Demand) Adaptor that enables Hadoop use with batch schedulers
Provisions Hadoop clusters on demand
Scheduling is handled by resource managers like Condor
Requests N nodes from a resource manager and provisions them with a Hadoop cluster
Condor interaction
User specifies: number of nodes, workload to launch
HOD generates class-ads for Hadoop master and slaves and submits them as Condor jobs
Cluster comes up when the jobs start running
HOD launches workload
11. HOD (Hadoop on Demand) HOD shell
User interface to HOD is a command shell
Workloads are specified as command lines
Example:
% bin/hod -c hodconf -n 100
>> run hadoop-streaming.jar mapper grep condor -reducer uniq -c
-input /user/sameerp/data output /user/sameerp/condor
Work in progress
Data affinity for workloads
Implementation of elastic workloads
Software distribution via BitTorrent
12. Hadoop on Condor Clients launch jobs
Condor dynamically allocates clusters
HOD used to start Hadoop Map-Reduce on cluster
Map-Reduce Reads/Writes Data from the HDFS
When done
Results are stored in HDFS and/or returned to the client
Condor reclaims nodes
13. Other things in the works Record I/O
Define a structure once, use it in C, Java, Python
Export it in a binary or XML format
Streaming
A simple way to use existing Unix filters and / or stdin/out programs in any language with Map-Reduce
Pig - Y! Research
Higher level data manipulation language, uses Hadoop
Data analysis tasks expressed as queries, in the style of SQL or Relational Algebra
http://research.yahoo.com/project/pig
14. The end