160 likes | 335 Views
MapReduce. michel.bruley@teradata.com. Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …. April 2012. What is MapReduce ?. Restricted parallel programming model meant for large clusters User implements Map() and Reduce() functions Parallel computing framework
E N D
MapReduce michel.bruley@teradata.com Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, … April 2012
What is MapReduce? • Restricted parallel programming model meant for large clusters • User implements Map() and Reduce() functions • Parallel computing framework • Libraries take care of EVERYTHING else • Parallelization • Fault Tolerance • Data Distribution • Load Balancing • Useful model for many practical tasks
Map and Reduce • The idea of Map, and Reduce is 40+ year old • Present in all Functional Programming Languages. • See, e.g., APL, Lisp and ML • Alternate names for Map: Apply-All • Higher Order Functions • take function definitions as arguments, or • return a function as output • Map and Reduce are higher-order functions.
Map and Reduce Functions • Functions borrowed from functional programming languages (eg. Lisp) • Map() • Process a key/value pair to generate intermediate key/value pairs • Reduce() • Merge all intermediate values associated with the same key
Example: Counting Words • Map() • Input <filename, file text> • Parses file and emits <word, count> pairs • eg. <”hello”, 1> • Reduce() • Sums all values for the same key and emits <word, TotalCount> • eg. <”hello”, (3 5 2 7)> => <”hello”, 17>
Execution on Clusters Input files split (M splits) Assign Master & Workers Map tasks Writing intermediate data to disk (R regions) Intermediate data read & sort Reduce tasks Return
Map/Reduce Cluster Implementation M map tasks R reduce tasks Input files Intermediate files Output files split 0 split 1 split 2 split 3 split 4 Output 0 Output 1 Several map or reduce tasks can run on a single computer Each intermediate file is divided into R partitions, by partitioning function Each reduce task corresponds to one partition
Map Reduce vs. Parallel Databases • Map Reduce widely used for parallel processing • Google, Yahoo, and 100’s of other companies • Example uses: compute PageRank, build keyword indices, do data analysis of web click logs, …. • Database people say: • but parallel databases have been doing this for decades • Map Reduce people say: • we operate at scales of 1000’s of machines • We handle failures seamlessly • We allow procedural code in map and reduce and allow data of any type
Map Reduce Implementations • Google • Not available outside Google • Hadoop • An open-source implementation in Java • Uses HDFS for stable storage • Download: http://lucene.apache.org/hadoop/ • Teradata Aster • Cluster-optimized SQL Database that also implements MapReduce • IITB alumnus among founders • And several others, such as Cassandra at Facebook, etc.
Solutions Stack for Teradata Aster Data Integration / ETL Business Intelligence Tools Query Tools Analytics Specialists Aster Data Ecosystem Systems Management Security Aster Data nCluster Operating System Aster Data Platform Infrastructure Servers Cloud Infrastructure Storage
Teradata Aster Platform Infrastructure For physical infrastructure (non-cloud) deployments Aster Data Analytic Platform Aster Data nCluster packaged software nCluster Operating System Certified Linux operating system Server Hardware Certified commodity (x86) server hardware with internal storage
Teradata Aster Infrastructure For cloud deployments Aster Data Analytic Platform Aster Data nCluster packaged software nCluster Operating System Linux operating system Compute Instance Compute instance from cloud provider (e.g. Amazon Web Services EC2) CC xLarge Storage Storage connected to cloud computing capacity EBS Ephemeral
Teradata Aster Architecture for Analytics Your Analytics & Advanced Reporting Applications • Support for in-database processing of custom applications written in broad variety of languages • Integration with third-party packaged software via ODBC/JDBC or in-database integration App App App App Aster Data nCluster Analytic Functions and Frameworks • Rich libraries of MapReduce analytics from Aster Data and partners • Visual development environment--develop in hours • Standard SQL interface • MapReduce processing integrated with SQL via SQL-MapReduce interface Unified Interface SQL-MapReduce SQL Analytics Processing Engines • Optimized SQL engine • Fully-integrated in-database MapReduce SQL MapReduce … Massively Parallel Data Stores • Hybrid row/column DBMS • Linear, incremental scalability • Commodity hardware
Teradata Aster Ecosystem *Oracle BIEE certification currently in process