Big Data and Hadoop On Windows

Big Data and Hadoop On Windows Image credit: morguefile.com/creative/imelenchon .Net SIG Cleveland

About Me • SerkanAyvaz, Sn. Systems Analyst, Cleveland Clinic PhD Candidate, Computer Science, Kent State Univ. • LinkedIn: serkanayvaz@gmail.com • email:ayvazs@ccf.org • Twitter:@sayvaz

Agenda • Introduction to Big Data • Hadoop Framework • Hadoop On Windows • Ecosystem • Conclusions

What is Big Data?(“Hype?”) • Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,search, sharing, transfer, analysis,and visualization.-Wikipedia

What is new? • Enterprise data grows rapidly • Emerging Market for Vendors • New Data Sources • Competitive industries - need for more Insights • Asking different questions • Generating models instead transforming data into models

What is the problem? • Size of Data; Rapid growth, TBs to PBs are norm for many organizations • As of 2012, size of data sets that are feasible to process in a reasonable amount of time were on the order of exabytes of data. • Variety of Data; Relational, Device generated data, Mobile, Logs, Web data, Sensor networks, Social Networks, etc • Structured • Unstructured • Semi-structured • Rate of Data Growth • As of 2012, every day 2.5 quintillion (2.5×1018) bytes of data were created -Wikipedia • Particularly large datasets; meteorology, genomics, complex physics simulations,and biological and environmental research, Internet search, finance and business informatics

Critique • Even as companies invest eight- and nine-figure sums to derive insight from information streaming in from suppliers and customers, less than 40% of employees have sufficiently mature processes and skills to do so. To overcome this insight deficit, "big data", no matter how comprehensive or well analyzed, needs to be complemented by "big judgment", according to an article in the Harvard Business Review. • Consumer privacy concerns by increasing storage and integration of personal information

Things to consider • Return of Investment may differ • Asking wrong questions, won’t get right answers • Experts to fit in the organization • Requires leadership decision • Might be fine with traditional systems(for now)

What is Hadoop? Hadoop Core • Scalability • Scales Horizontally, Vertical scaling has limits • Scales seamlesly • Moves processing to the data, opposed to traditional methods • Network bandwidth is limited resource • Processes data sequentially in chunks, avoid random access • Seeks are expensive, disk throughput is reasonable • Fault tolerance • Data Replication • Economical • Commodity-Servers(“not Low-end”) vs Specialized Servers • Ecosystem • Integration with other tools • Open Source • Innovative, Extensible HDFS Storage MapReduce Processing

What can I do with Hadoop? • Distributed Programming(MapReduce) • Storage, Archive Legacy data • Transform Data • Analysis, Ad Hoc Reporting • Look for Patterns • Monitoring/ Processing logs • Abnormality detection • Machine Learning and advanced algorithms • Many more

HDFS Blocks • Large enough to minimize the cost of seeks-64 MB default • Unit of abstraction makes storage management simpler than file • Fits well with replication strategy and availability NameNode • Maintainsthe filesystem tree and metadata for all the files and directories • Stores the namespace image and edit log Datanode • Store and retrieve blocks • Report the blocks back to NameNode periodically

HDFS • Designed for and Shines with large files • Fault tolerance - Data Replication within and across Racs • Hadoop breaks data into smaller blocks • Data locality • Most efficient with write-once, read-many-times pattern Good Not so good • Low-latency data access • optimized for high throughput data, may be at the expense of latency. • Consider Hbase for low latency • Lots of small files • namenode holds filesystem metadata in memory • the limit to the number of files in a filesystem • Multiple writers, arbitrary file modifications • Files in HDFS may be written to by a single writer.

Data Flow Read Write Source:Hadoop:The Definitive Guide

MapReduce Programming • Splits input files into blocks • Operates on key-value pairs • Mappers filter & transform input data • Reducers aggregate mappers output • Handles processing efficiently in parallel • Move code to data – data locality • Same code run on all machines • Can be difficult to implement some algorithms • Can be implemented in almost any language • Streaming MapReduce for python, ruby, perl, php etc • pig latin as data flow language • hive for sql users

MapReduce • Programmers write two functions: map (k, v) → <k’, v’>* reduce (k’, v’) → <k’, v’>* • All values with the same key are reduced together • For efficiency, programmers typically also write: partition (k’, number of partitions) → partition for k’ • Often a simple hash of the key, e.g., hash(k’) mod n • Divides up key space for parallel reduce operations combine (k’, v’) → <k’, v’>* • Mini-reducers that run in memory after the map phase • Used as an optimization to reduce network traffic • The framework takes care of rest of the execution

Simple example - Word Count // Map Reduce function in JavaScript // ------------------------------------------------------------- varmap = function (key, value, context) { var words = value.split(/[^a-zA-Z]/); for (vari = 0; i < words.length; i++) { if (words[i] !== "") { context.write(words[i].toLowerCase(), 1); } } }; varreduce = function (key, values, context) { var sum = 0; while (values.hasNext()) { sum += parseInt(values.next()); } context.write(key, sum); };

Input Divide and Conquer k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 map map map map x 4 y 1 z 3 z 6 x 6 z 1 y 7 z 9 combine combine combine combine x 4 y 1 z 9 x 6 z 1 y 7 z 9 partition partition partition partition Shuffle and Sort: aggregate values by keys x x 4 10 6 y y 8 1 7 z z z 19 1 1 9 9 8 9 reduce reduce reduce r1 r2 r3 s3 Output

How MapReduce Works? Map(String docid, String text): for each word w in text: Emit(w, 1); Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value); Source:Hadoop:The Definitive Guide

How is it different from other Systems? • Parallel - Message Passing Interfaces(MPI) • Compute-intensive jobs, • Issue larger data volumes • Network bandwidth is the bottleneck and compute nodes become idle. • Hard to implement • Challenge of Coordinating the processes in a large-scale distributed computation • Handling partial failure • Managing check pointing and recovery

Comparing MapReduce to RDBMs

MapReduce • MapReduce complementary to RDBMs, not competing • MapReduce good fit for analyzing the whole dataset in batch • An RDBMS is good for point queries or updates • indexed to deliver low-latency retrieval • relatively small amount of data. • MapReduce suits applications where the data is written once and read many times, • An RDBMS is good for datasets that are continually updated.

Hadoop on Windows Overview HDInsight HDInsight Server HDInsight on Cloud Familiar Tools &Functionality Hortonworks Data platform Windows Platform 100% Open Source Contributions to Community Apache Hadoop Core Common framework Open Source Community Shared by all Distribution

Hadoop on Windows • Standard Hadoop Modules • HDFS • MapReduce • Pig • Hive • Monitoring Pages • Easy installation and Configuration • Integration with Microsoft system • Active Directory • System Center • etc

Why Hadoop on Windows important? • Windows Server Large Market share • Large Developer and User community • Existing Enterprise tools • Familiarity • Simplicity of Use and Management • Deployment options on both Windows Server and Windows Azure.

User -Self Service Tools: Data Viewers, BI, Visualization HADOOP [Server and Cloud] Java Streaming HiveQL PigLatin .NET Other langs. HDFS DATA [unstructured, semi-structured, structured] NOSQL SQL HDFS • External Data • Web • Mobile Devices • Social Media Legacy Data RDBMS

Run Jobs • Submit a JAR file(Java MapReduce) • HiveQL • PigLatin • .NET wrapper through Streaming • .NetMapReduce • LINQ to Hive • JavaScript Console • Excel Hive Add-In

.NetMapReduce Example NuGet Packages install-package Microsoft.Hadoop.MapReduce install-package Microsoft.Hadoop.Hive install-package Microsoft.Hadoop.WebClient Reference “Microsoft.Hadoop.MapReduce.DLL” Create a class the implements “HadoopJob<YourMapper>Create a class called “FirstMapper” that implements “MapperBase” Run DLL using MRRunner Utility; > MRRunner -dllMyDll -class MyClass -- extraArg1 extraArg2 Run Invoke Exe using MRRunner Utility; varhadoop = Hadoop.Connect(); hadoop.MapReduceJob.ExecuteJob<JobType>(arguments);

.NetMapReduce Example public class FirstJob : HadoopJob<SqrtMapper> { public override HadoopJobConfiguration Configure(ExecutorContext context) { HadoopJobConfigurationconfig = new HadoopJobConfiguration(); config.InputPath = "input/SqrtJob"; config.OutputFolder = "output/SqrtJob"; return config; } } public class SqrtMapper : MapperBase { public override void Map(string inputLine, MapperContext context) { intinputValue = int.Parse(inputLine); // Perform the work. double sqrt = Math.Sqrt((double)inputValue); // Write output data. context.EmitKeyValue(inputValue.ToString(), sqrt.ToString()); } }

Hadoop Ecosystem • Hadoop • Common, MapReduce, HDFS • HBase • Column oriented distributed database • Hive • Distributed data warehouse-SQL like query platform • Pig • Data transformation language • Sqoop • Tool for bulk Import/export between HDFS, HBase, Hive and relational databases • Mahout • Data Mining Algorithms • ZooKeeper • Distributed Coordination service • Oozie • Job Running and scheduling workflow service

What’s HBase? • Column Oriented DistiributedDB • Inspired by Google BigTable • Uses HDFS • Interactive Processing • Can use either without MapRed • PUT, GET, SCAN Commands

What’s Hive? • Translate HiveQL ,similar to SQL, to MapReduce • A Distributed Data warehouse • HDFS table file format • Integrate with BI products on tabular data, Hive ODBC, JDBC drivers

Hive • HiveQL – Familiar, high level language • Batch jobs – Ad Hoc Queries • Self service BI tools via ODBC, JDBC • Schema but not strict as traditional RDBMs • Supports UDFs • Easy access to Hadoop data Good for Not so good for • No Updates or deletes, Insert only • Limited Indexes, built-in optimizer, no caching • Not OLTP • Not fast as MapReduce

Conclusion • Hadoop is great for its Purposes and here to stay • BUT Not a common cure for every problem • Developing standards and best practices very important • Users may abuse the resources and scalability • Integration with Windows Platform • Existing systems, tools, Expertise • Parallelization • Easier to scale as need • Economical • Commodity Hardware • Relatively short training, application development time with Windows

Resources&References • Hadoop: The Definitive Guide by Tom White • http://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/0596521979 • Data-Intensive Text Processing with MapReduce, Jimmy Lin and Chris Dyer. Morgan & Claypool Publishers, 2010. • Apache Hadoop • http://hadoop.apache.org/ • Microsoft Big data page • http://www.microsoft.com/en-us/sqlserver/solutions-technologies/business-intelligence/big-data.aspx • Hortonworks Data Platform • http://hortonworks.com/products/hortonworksdataplatform/ • Hadoop SDK • http://hadoopsdk.codeplex.com/

Thank you! Any Questions?

Big Data and Hadoop On Windows