360 likes | 387 Views
Explore the concepts of Big Data, motivation behind its significance, and the role of Hadoop in handling massive data sets. Learn about Hadoop components, scalability principles, and its comparison with traditional databases. Discover Hadoop distributions and common tools for data analysis.
E N D
Big Data Technology: Introduction to Hadoop Antonino Virgillito
Motivation • The main characterization of Big Data is mostly to be…well… “Big” • Intuitive definition: a size that “creates problems” when handled with ordinary tools and methods • However, the exact definition of “Big” is a moving target • Where do we draw a line? • Big Data tools in IT were specifically tailored to handle those cases when common data handling tools fail for some reasons • E.g. Google, Facebook…
Motivation • Large size that grows continuously and indefinitely • Difficult to define a size of the storage that can fit • Processing and querying huge data sets require a lot of memory and CPU • No matter how much you expand the technical specifications: if data is “Big” you eventually hit the roof…
Is Big Data Big in Official Statistics? • Do we really have to handle those massive dimensions? • Think about the largest dataset you ever used… • Yes • The example of scanner data in Istat • Maybe • We should be ready when it will happen • No • Big Data technology can still be useful for complex processing of «normal» data sets
Big Data Technology • Handling volume -> Distributed platforms • The standard: Hadoop • Handling variety -> NoSQL databases
Hadoop • Open source platform for distributed processing of large data • Distributed: works on a cluster of servers • Functions: • Distribution of data and processing across machine • Management of the cluster • Distribution is transparent for the programmer-analyst
Hadoop scalability • Hadoop can reach massive scalability by exploiting a simple distribution architecture and coordination model • Huge clusters can be made up using (cheap) commodity hardware • A 1000-CPU machine would be much more expensive than 1000 single-CPU or 250 quad-core machines • Cluster can easily scale up with little or no modifications to the programs
Hadoop Components • HDFS: Hadoop Distributed File System • Abstraction of a file system over a cluster • Stores large amount of data by transparently spreading it on different machines • MapReduce • Simple programming model that enables parallel execution of data processing programs • Executes the work on the data near the data • In a nutshell: HDFS places the data on the cluster and MapReduce does the processing work
Hadoop is basically a middleware platforms that manages a cluster of machines Hadoop Principle I’m one big data set The core components is a distributed file system (HDFS) Files in HDFS are split into blocks that are scattered over the cluster Hadoop HDFS The cluster can grow indefinitely simply by adding new nodes
MapReduce and Hadoop Hadoop MapReduce is logically placed on top of HDFS MapReduce HDFS
MapReduce and Hadoop MR works on (big) files loaded on HDFS Hadoop Each node in the cluster executes the MR program in parallel, applying map and reduces phases on the blocks it stores MR MR MR MR HDFS HDFS HDFS HDFS Output is written on HDFS Scalability principle: Perform the computation were the data is
The MapReduce Paradigm Parallel processing paradigm Programmer is unaware of parallelism Programs are structured into a two-phase execution x 4 x 5 Map x 3 Reduce Data elements are classified into categories An algorithm is applied to all the elements of the same category
Hadoop pros & cons • Good for • Repetitive tasks on big size data • Not good for • Replacing a RDMBS • Complex processing requiring various phases and/or iterations • Processing small to medium size data
Hadoop vs. RDBMS • Hadoop • is not transactional • is not optimized for random access • does not natively support data updates • privileges long-running, batch work • RDBMS • disk space is more expensive • cannot scale indefinitely
Hadoop Distributions • Hadoopis an open source projectpromoted by the Apache Foundation • Assuch, it can be downloaded and used for free • However, all the configuration and maintenance of all the components must be done by the user, mainly with command-line tools • Software vendorsprovideHadoopdistributionsthat facilitate in various ways the use of the platform • Distributions are normally free butthereis a paid-for support • Additionalfeatures • User interface • Management console • Installation tools
Common Hadoop Distribution • Hortonworks • Completely open-source • Also have a Windows version • Used in: Big Data Sandbox • Cloudera • Mostly standard Hadoop but extended with proprietary components • Highlights: Cloudera Manager (console) and Impala (high-performance query) • Used in: Istat Big Data Platform
Tools for Data Analysis with Hadoop Pig Hive Hadoop Statistical Software MapReduce HDFS
Hive • Hive is a SQL interface for Hadoop that facilitates queries of data on the file system and the analysis of large datasets stored in Hadoop • Hive provides a SQL-like language called HiveQL • Well, it is SQL • Due its straightforward SQL-like interface, Hive is increasingly becoming the technology of choice for using Hadoop
Using Hive • Files in tabular format stored in HDFS can be represented as tables • Sets of typed columns • Tables are treated in the traditional way like in a relational database • However, a query translates triggers ore or more MapReduce jobs • Things can get slow… • All common SQL constructs can be used • Joins, subqueries, functions
Hive vs. RDBMS • Hive works on flat files and does not support indexes and transactions • Hive does not support updates and deletes. Rows can only be added incrementally • A table is actually a directory in HDFS, so rows are inserted just by adding new files in the directory • In this sense, Hive works more as a datawarehouse than as a DBMS
Pig • Tool for querying data on Hadoop clusters • Widely used in the Hadoop world • Yahoo! estimates that 50% of their Hadoop workload on their 100,000 CPUs clusters is genarated by Pig scripts • Allows to write data manipulation scripts written in a high-level language called Pig Latin • Interpreted language: scripts are translated into MapReduce jobs • Mainly targeted at joins and aggregations
Pig: Motivations • Pig is another high-level interface to MapReduce • Scripts written in PigLatin translate into MapReduce jobs • However, working in Pig is much simpler than writing native MapReduce programs
Pig Commands Loading datasets from HDFS users = load 'Users.csv' usingPigStorage(',') as (username: chararray, age: int); pages = load 'Pages.csv' usingPigStorage(',') as (username: chararray, url: chararray);
Pig Commands Filtering data users_1825 = filter users by age>=18 and age<=25;
Pig Commands Join datasets joined = join users_1825 by username, pages by username;
Pig Commands Group records grouped = group joined byurl; Creates a new dataset with an elements named group and joined. There will be one record for each distinct url: dump grouped; (www.twitter.com, {(alice, 15), (bob, 18)}) (www.facebook.com, {(carol, 24), (alice, 14), (bob, 18)})
Pig Commands Apply function to records in a dataset summed = foreach grouped generate group as url, COUNT(joined) AS views;
Pig Commands Sort a dataset sorted = order summed by views desc; Filter first n rows top_5 = limit sorted 5;
Pig Commands Writes a dataset to HDFS store top_5 into 'top5_sites.csv';
Word Count in Pig A = load '/tmp/bible+shakes.nopunc'; B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word; C = filter B by word matches '\\w+'; D = group C by word; E = foreach D generate COUNT(C) as count, group as word; F = order E by count desc; store F into '/tmp/wc';
Pig: Used Defined Functions • There are times when Pig’s built in operators and functions will not suffice • Pig provides ability to implement your own • Filter • Ex: res = FILTER bag BY udfFilter(post); • Load Function • Ex: res = load 'file.txt' using udfLoad(); • Eval • Ex: res = FOREACH bag GENERATE udfEval($1) • Choice between several programming languages • Java, Python, Javascript
Hive vs. Pig • Hive • Usesplain SQL so itisstraightforward to start with • Requires data to be in tabular format • Onlyallow single queries to be issued • Pig • Requireslearning a new language • Allows to work on data in a free schema • Allows to write scripts with multiple processing steps • Bothlanguages can be used for pre-processing and analysis
Interactive Querying in Hadoop • Responsetimes of MapReduce are typically slow and makesunsuitable for interactiveworkloads • Hadoopdistributionsprovide alternative solutions for querying data with lowlatency • Hortonworks: Hive-on-Tez • Cloudera: Impala • The idea is to bypass the MapReducemechanism and avoidits high latency • Great advantage for aggregationqueries • PlainHivestillmakessense for low-throughput data transformations
Using Hadoop from Statistical Software • R • packagesrhdfs, rmr • Issue HDFS commandsand writeMapReducejobs • SAS • SAS In-Memory Statistics • SAS/ACCESS • Makes data stored in Hadoopappearas native SAS datasets • UsesHiveinterface • SPSS • Transparentintegration with Hadoop data
RHadoop • Set of packages that allows integration of R with HDFS and MapReduce • Hadoop provides the storage while R brings the analysis • Just a library • Not a special run-time, Not a different language, Not a special purpose language • Incrementally port your code and use all packages • Requires R installed and configured on all nodes in the cluster
WordCount in R wordcount = function( input, output = NULL, pattern = " "){ wc.reduce = function(word, counts ) { keyval(word, sum(counts))} mapreduce( input = input , output = output, input.format = "text", map = wc.map, reduce = wc.reduce, combine = T)} wc.map = function(., lines) { keyval( unlist( strsplit( x = lines, split = pattern)), 1)}