320 likes | 559 Views
Hadoop for Big Data Analytics on SPARC T5 Servers. Debabrata Sarkar Pirama Arumuga Nainar Jeff Taylor Performance Technologies, Systems Group. Disclaimer.
E N D
Hadoop for Big Data Analytics on SPARC T5 Servers Debabrata Sarkar PiramaArumugaNainar Jeff Taylor Performance Technologies, Systems Group
Disclaimer The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle Corporation.
Program Agenda • What is Big Data Analytics and why we need Hadoop • Hadoop and Oracle RDBMS • New 12c feature enables In-Database Map Reduce • Why run Hadoop on SPARC T5 servers • Hadoop performance on SPARC T5 servers • Takeaways
What is Big Data Analytics? • Big Data Analytics is the process of examining large amount of data to uncover hidden patterns, unknown correlations etc. • Both structured and unstructured, often from disparate sources. • Conventional BI tools usually operate on relational databases and are not designed to analyze large volume of unstructured data • For example, Twitter feed, click stream data • For Big Data Analytics, often user does not know how to formulate a query. For example, want to divide customers into segments, but do not know which dimensions to use.
Why Hadoop for Big Data Analytics? • Hadoop is also suitable for analyzing relational data because user writes custom code for analyzing it, not limited by SQL language constructs. • Hadoop is very good for “embarrassingly” parallel problems and many Big Data Analytics problems are of that kind • Big Data Analytics often needs aggregation algorithms, much more sophisticated than usual Max/Min/Average etc. Mahout, a sub-project for Apache Hadoop, provides a rich set of libraries like K-Means clustering, Logistic regression, Principal Components Analysis etc. • Hadoop Distributed File System reduces cost of storing large volume of unstructured data, when security is not of big concern
Apache Mahout: Machine Learning at Scale • A set of Machine Learning algorithms, which runs on topof Hadoop Map-Reduce Engine. • The goal of this Apache project (top level since 2010): • Build massively scalable machine learning libraries. • Be as fast and efficient as algorithm allows. • Algorithms implemented include: • Collaborative filtering, User and Item based recommendation, K-Means, Fuzzy K-Means clustering, Mean Shift clustering, Dirichlet process clustering, Complementary Naive Bayes classifier and many more
Hadoop/Mahout Use Cases * Big Data Analytics Customer Segmentation * Complexity Credit & Market Risk Analysis in Banks * Analysis of what products are often bought together * Drug side effect discovery * Video Surveillance Analysis Social Media Sentiment Analysis Traditional Data Warehousing Text Mining Volume
Program Agenda • What is Big Data Analytics and why we need Hadoop • Hadoop and Oracle RDBMS • Why run Hadoop on SPARC T5 servers • Hadoop performance on SPARC T5 servers • Takeaways
Processing Relational Data using Hadoop Traditional Approach RDBMS Hadoop Structured Data MapReduce Processing Aggregated Data Data Mining
Limitations of the current approach • Operational Issues • Often too big to move across wide area network • Data Correctness/Loss • Legal issues, in some cases • Lack of Enterprise Class Security on MapReduce Infrastructure
Big Data with Oracle In-Database MapReduce Hadoop Cluster RDBMS HDFS, NoSQL MapReduce Data Mining Unstructured Data In-Database MapReduce Structured Data
Why run Hadoop on SPARC T5 servers • Hadoop is usually run on cluster of small systems • However researchers have experimented with Hadoop on multi core systems as far back as 2006 • Advances in H/W that make us revisit the above scenario • Today’s systems have same number of core as yesterday’s moderate sized clusters • 1 TB systems are becoming common, 32 TB systems already announced. Iterative M/R algorithms run better on main memory • The dataset that needs to be analyzed is already inside database running on a T5 server
Overview of Oracle SPARC T5 • SPARC T5 Chip • SPARC T5 Servers • SPARC T5 Benchmark Results
Oracle SPARC T5 Processor • 16 SPARC S3 cores per chip at 3.6GHz (128 virtual CPUs per chip) • Improved chip scalability, memory bandwidth and I/O bandwidth. • 8MB shared L3 Cache • 4 integrated DDR3 on-die memory controllers. 2X memory bandwidth • Integrated PCI Express Generation 3, doubles the I/O bandwidth vs. T4 • S3 processor features: • Dynamic threading • Sophisticated branch prediction and pre-fetching • On-chip cryptography for secure application deployment
T5 servers • T5-1B – One chip with up to 512GB memory in blade chassis • T5-2 – Two chips with up to 1 TB memory in 2RU chassis • T5-4 – Four chips with up to 2 TB memory in 4RU chassis • T5-8 – Eight chips with up to 4 TB memory in 5RU chassis SPARC T5-1B SPARC T5-2 SPARC T5-8 SPARC T5-4
SPARC T4/T5 & Oracle Solaris: Twenty #1’s • #1 Siebel Loyalty Batch • #1 JD Edwards Online • #1 PeopleSoft Payroll Batch • #1 E-Business Order to Cash • #1 SPECjvm • Oracle Fusion Middleware & Oracle Apps • #1 Fusion: SPECjEnterprise • #1 E-Business Consolidation • #1 PeopleSoft HR/SS • #1 JD Edwards Consolidation • #1 Siebel PSPP • #1 PeopleSoft FMS Batch • Industry Applications • #1 Communications Activation • #1 Communications Service Broker • #1 BRM • #1 Financial Modeling • Database & Analytics • #1 3TB TPC-H 4-processor • #1 Oracle Database Security TDE • #1 Oracle OLAP • #1 Oracle BI EE • #1 Essbase Leads in every area! ERP HCM SCM CRM FMS SRM BI-DW OLTP See benchmark disclosure slide
Program Agenda • What is Big Data Analytics and why we need Hadoop • Hadoop and Oracle RDBMS • Why run Hadoop on SPARC T5 servers • Hadoop performance on SPARC T5 servers • Takeaways
Experimental Setup • SPARC T5-4 server • 512 virtual CPUs, 1024 GB memory • 20 VMs (Solaris Zones) • Number of VMs fixed across experiments • RAM Disks used as storage for HDFS • Network traffic over virtual interfaces (VNICs) • Hadoop 1.1.2, Mahout 0.7 • JDK 1.6, Solaris 11.1
Workloads • Terasort • Most common Hadoop benchmark • Co-occurrence • Often used to analyze business data • K-Means clustering • Machine learning algorithm • Also studied the effect of running them together
TeraSort • Not be compared with published results • Apples-to-oranges: completely different hardware and software configuration • Dataset size varied from 10GB to 100GB • Limited to keep the working data set in memory • Scaling is in-line with expectations • From 10GB to 100GB, data increases 10x • Computation increases ~30x more for n log n algorithm • 8x mappers limits increase to roughly 4.5x
Co-occurence • An instance of complex event processing (CEP) • From a stream of events, find subsets of events that meet a criteria • Example: • Input: Customer purchase history • Output: Products that are often purchased together • Used in product recommendation systems • Characteristics of the algorithm • Uses a sliding window over event stream – hence worst-case quadratic • “Exploding map” characteristic – intermediate data size much higher than actual input and output size
Co-occurrence Time (in seconds) for Co-occurrence • Only tuning performed was adjusting number of mappers and reducers • Optimal DOP depends on the dataset size • 500M purchase orders is realistic dataset size for even a big retailer – Hadoop and T5 can handle this with ease
K-means Clustering • Widely used machine learning algorithm • Applications include market segmentation, product positioning, recommender systems, social network analysis, and medical imaging. • Problem: • Input: Points in an n-dimensional space and a parameter k • Output: Grouping of the points into k clusters and a representative/central point for each cluster • Distributed implementation in Apache Mahout
K-means Clustering 75M points in n=20 dimensional space Divided into k=32 clusters, over 5 iterations Input data resides in memory all the time
Effect of Multiple Simultaneous Workloads • Executed K-means and Co-occurrence workloads in parallel • No special tuning done • K-means slows by 20-25% • Co-occurrence slows down by 17% Time (in seconds) for K-means – Solo vs. Shared
Key Findings • Hadoop can take advantage of large number of cores on T5 • Scaling is achieved using virtualization (Solaris Zones) • Many VMs, each running a few mappers and reducers • Multiple Hadoop jobs can execute simultaneously without any significant degradation in performance • Switching between a cluster of physical machines and a virtual cluster running on a massively multi-core machine is seamless and easy
Takeaways • Apache Hadoop and Mahout runs very well on T5: • Today’s multi core machines are equivalent to yesterday’s clusters • Not every one wants to run Hadoop with 10,000 nodes • Many use cases have both relational data AND unstructured data • In-Database Hadoop is a great tool for MapReduce processing of relational data • Many Hadoop jobs can be run simultaneously on a single T5 • Easier sharing using Solaris resource management features • Easier to implement multiple virtual Hadoop clusters