High Performance Computing Solutions for Data Mining Prof. Navneet Goyal

High Performance Computing Solutions for Data Mining Prof. Navneet Goyal

Topics Big Data • Sources • Characteristics • Management • Analytics • The Road Ahead The need for HPC for taming BIG DATA The holy grail of Programming: Performance • Abstraction vs. Performance • Parallel Domain Specific Languages (PDSL) Distributed Computing: Clusters Parallel Computing: New Avtaar MapReduce & Hadoop What we are doing @ BITS-Pilani Motivation and Future Directions

BIG DATA • Just a Hype? • Or a real Challenge? • Or a great Opportunity?

BIG DATA • Just a Hype? • Or a real Challenge? • Or a great Opportunity? • Challenge in terms of how to manage this data • Opportunity in terms of what we can do with this data to enrich the lives of everybody around us and to make our mother Earth a better place to live • IBM’s Smarter Planet Initiative!

Best Quote so far… Dhiraj Rajaram, Founder & CEO of Mu Sigma, leading Data Analytics co. "Data is the new 'oil' and there is a growing need for the ability to refine it,"

Another Quote “We don’t have better Algorithms, We just have more data” - Peter Norvig, Director of Research, Google

What Is Big Data? • There is not a consensus as to how to define big data “Big data exceeds the reach of commonly used hardware environments and software tools to capture, manage, and process it with in a tolerable elapsed time for its user population.” - Teradata Magazine article, 2011 “Big data refers to data sets whose size is beyond the ability of typical database software tools to capture, store, manage and analyze.” - The McKinsey Global Institute, 2011

Big Data …by the end of 2011, this was about 30 billion and growing even faster In 2005 there were 1.3 billion RFID tags in circulation… Source: Slides of Dean Compher, IBM

RFID • Radio Frequency ID tags (RFID). • Wal-Mart redesigned their supply chain with RFID • Cost of RFID tags have come down so much, they’ve just proliferated all over the world. • A good place to start with Big Data, because they are now ubiquitous as is the opportunity for Big Data. • track cars on a toll route, • food supplies for temperature transport, • livestock, • inventories, • luggage, • retail, • tickets used for transportation • … Source: Slides of Dean Compher, IBM

An increasingly sensor-enabled and instrumented business environment generates HUGE volumes of data with MACHINE SPEED characteristics… LHR  JKF: 640TBs 1 BILLION lines of codeEACH engine generating 10 TB every 30 minutes!

Aircrafts • Aircrafts are hugely sensor enabled devices that are instrumented to collect data as they operate. They also generate huge volumes of data. • For this particular Airbus, over a billion lines of a code and a single engine generates 10 terabytes of data every 30 minutes. • And so there’s four engines there, right? • UK to New York would generate 640 TB of data.

350B Transactions/YearMeter Reads every 15 min. 120M – meter reads/month 120M – meter reads/month 3.65B3.65B – meter reads/day

In August of 2010, Adam Savage, of “Myth Busters,” took a photo of his vehicle using his smartphone. He then posted the photo to his Twitter account including the phrase “Off to work.” Since the photo was taken by his smartphone, the image contained metadata revealing the exact geographical location the photo was taken By simply taking and posting a photo, Savage revealed the exact location of his home, the vehicle he drives, and the time he leaves for work

The Social Layer in an Instrumented Interconnected World 4.6 billion camera phones world wide 30 billion RFID tags today (1.3B in 2005) 12+ TBsof tweet data every day 100s of millions of GPS enabled devices sold annually ? TBs ofdata every day 2+ billion people on the Web by end 2011 25+ TBs oflog data every day 76 million smart meters in 2009… 200M by 2014

Big Data Includes Any of the following Characteristics Extracting insight from an immense volume, variety and velocity of data, in context, beyond what was previously possible Variety: Velocity: Volume: Manage the complexity of data in many different structures, ranging from relational, to logs, to raw text Streaming data and large volume data movement Scale from Terabytes to Petabytes (1K TBs) to Zetabytes (1B TBs)

Big Data • Three Key things with Big Data • Volume (2.5 Exabytes of data created every day and doubles every 40 months) • Velocity (Real-time or nearly real-time info needed) • Variety (New sources of data; unstructured) • To turn all this information into Competitive Gold, need Innovative thinking and HPC tools (Choose the right data; build models that predict and optimize) • Using Big Data enables organizations to decide on the basis of Evidence rather than Intuition • Data Scientists who manage Big Data and find insights in to it

3 Vs of Big Data • The “BIG” in big data isn’t just about volume

Introduction We are generating more Data that we can handle!!! That’s why we are here!!! BIG DATA Using Data to our benefit is a far cry!!! In future, everything will be Data driven High time we figured out how to tame this “monster” and use it for the benefit of the society

Introduction BIG DATA poses a big challenge to our capabilities • Data scaling outdoing scaling of compute resources • CPU speed not increasing either At the same time, BIG DATA offers a BIGGER opportunity Opportunity to • Understand nature • Understand evolution • Understand human behavior/psychology/physiology • Understand stock markets • Understand road and network traffic • … Opportunity for us to be more and more innovative!!

Taming BIG DATA • Divide & Conquer • Partition large problem into smaller “independent” sub-problems • Can be handled by different workers • Threads in a processor core • Cores in a multi/many core processor • Multiple processors in a machine • Multiple machines in a cluster • Multiple clusters in a cloud • …… • …… Abstraction

Analyzing BIG DATA Data analysis, organization, retrieval, and modeling are other foundational challenges. Data analysis is a clear bottleneck in many applications, both due to lack of scalability of the underlying algorithms and due to the complexity of the data that needs to be analyzed* *Challenges and Opportunities with Big Data A community white paper developed by leading researchers across the United States

Source:Challenges and Opportunities with Big Data A community white paper developed by leading researchers across the United States

Programming Models Abstraction vs. Performance Look at the Generation of Languages Evolution from Machine Language to NLP • Abstraction increasing • What about Performance? Are we paying too high a price for high levels of abstraction?? A delicate trade-off!! Holy Grail: Desired level of (abstraction + performance)

Domain Specific Languages (DSL) Lot of interest in DSLs recently High-level languages Optimized for a particular domain Two types: • Internal (embedded in a host language) • External (new language, new compiler – more tedious) OptiML is an internal DSL embedded in SCALA BIG DATA has necessitated the need for developing Parallel DSLs (PDSL) Kevin J. Brown et al. A Heterogeneous Parallel Framework for Domain-Specific Languages. PACT 2011: 89-100 A. K. Sujeeth ey al. OptiML: an implicitly parallel domain specific language for machine learningICML, 2011

Programming Issues Need to go “wide” & “deep”! WIDE – more nodes in a cluster DEEP – more cores in a node Active Research in both WIDE & DEEP models Nodes are typically multicore WIDE models at the mercy of OS for leveraging multicores WIDE & DEEP models not necessarily orthogonal Combining WIDE and DEEP models – Nontrivial Hybrid Programming Models – MPI + OPEN MP

Distributed Computing Existing distributed computing options like Cluster & Grid computing provide low levels of abstraction Programmer has to deal with: • synchronization, deadlocks, data dependency, mutual exclusion, replication, reliability, platform scalability and provisioning Too much to ask from a data mining researcher!! What are the solutions available?

MapReduce/Hadoop MapReduce* - Aprogramming model & its associated implementation • provides a high level of abstraction • but has limitations • Only data parallel tasks stand to benefit! MapReduce hides parallel/distributed computing concepts from users/programmers Even novice users/programmers can leverage cluster computing for data-intensive problems Cluster, Grid, & MapReduce are intended platforms for general purpose computing Hadoop/PIG combo is very effective! *MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, OSDI, 2004

MapReduce/Hadoop Input Map Shuffle Reduce Output http://www.slideshare.net/hadoop/practical-problem-solving-with-apache-hadoop-pig By Milind Bhandarkar

Research Gaps Distributed computing is the only option for large scale data analytics. But • current distributed computing systems are for general purpose computing and • they support low levels of abstraction (parallelization is non trivial) MapReduce does provide higher level of abstraction, • but is not suitable for Data Mining • (evident from literature survey) • in particular, it scales only for data parallel problems Need for a scalable distributed computing framework that provides both abstraction and performance

Research Gaps K-means on MapReduce/Hadoop • A typical data-parallel problem • Suited for MapReduce DBSCAN or OPTICS • Not a data-parallel problem • Not suitable for MapReduce • Throws up new data distribution and algorithmic challenges Need for a scalable distributed computing framework that allows us to efficiently exploit all kinds of parallelisms that exist in an algorithm

@ CS Department: BITS-Pilani Programming Model Researchers (DM Algorithms) End users (DM Applications) Distributed File System + ~20 TB NAS for Central Storage INFINIBAND CLUSTER 48 Multicore NODES

@ CS Department: BITS-Pilani Programming Model SCALABLE SERVICE End users (DM Applications) Researchers (DM Algorithms) Distributed File System + ~20 TB NAS for Central Storage CLOUD INFINIBAND CLUSTER 48 multicore NODES

@ CS Department: BITS-Pilani Advanced Data Analytics & Parallel Technologies Lab. (ADAPT LAB.) – ADAPTingto the future A new distributed computing framework for data mining HPC Division Department of Electronics & Information Technology (DEITY) INR 1.20 Crores (3 Years) Navneet Goyal, Poonam Goyal, & Sundar B Collaborators: IASRI, New Delhi Full-time PhD Students: Sonal Kumari (TCS PhD Fellow) Mohit Sati Jagat Sesh Challa Saiyedul Islam (TCS PhD Fellow)

@ CS Department: BITS-Pilani Advanced Data Analytics & Parallel Technologies Lab. (ADAPT LAB.) – ADAPTingto the future 48 node Bewoulf Cluster 20 TB NAS Intel Cluster Studio Vampir Standard Vmware Vcenter IBM SPSS Modeler

@ CS Department: BITS-Pilani Advanced Data Analytics & Parallel Technologies Lab. (ADAPT LAB.) – ADAPTingto the future • Programming Environment MPI 2.0 (Open MPI) – Distributed Memory • Does not exploit cores • MPI 3.0 exploits cores Open MP vs. TBB (intel) vs. Open CL – Shared Memory Profiling & Debugging Tools • Vampir • PGI • TotalView Intel Cluster Studio XE

Future Directions Interpretation of Analysis • Data Visualization Better Programming Models Data Modeling Sampling Techniques

Skill Sets • Probability & Statistics • Linear Algebra • Vector Analysis

Thank you Q & A goel@pilani.bits-pilani.ac.in

High Performance Computing Solutions for Data Mining Prof. Navneet Goyal