330 likes | 342 Views
This course introduces the characteristics of Big Data, presents the 3V model, discusses data variety, velocity, and volume, and explores different sources of Big Data. It also provides an overview of the Apache Hadoop framework and other frameworks for handling Big Data.
E N D
Big Data course Imam Khomeini international University, 2019 Dr. Ali Khaleghi| Kamran Mahmoudi
Session one Introduction to big data • Session objective: • Introducing the characteristics of Big Data • Presenting the 3 V Model for defining Big Data • Discussing the Variety of Data Structures, Velocity of Data generation and the Volume • Introducing different sources of Big Data • Introducing the frameworks for handling Big Data • Presenting a snap shot of the Apache Hadoop Framework • Technical information on Running Hadoop
Big data, Massive Data, small data !! What's the difference? Large volume of a variety of UNSTRUCTURED or SEMI-STRUCTURED data sets that are produced and processed rapidly !
4.6 billion camera phones world wide 30 billion RFID tags today (1.3B in 2005) • 12+ TBsof tweet data every day 100s of millions of GPS enabled devices sold annually ? TBs ofdata every day 2+ billion people on the Web by end 2011 • 25+ TBs oflog data every day 76 million smart meters in 2009… 200M by 2014
Volume (Scale) • Data Volume • 44x increase from 2009 2020 • From 0.8 zettabytes to 35zb • Data volume is increasing exponentially Exponential increase in collected/generated data
Variety (Complexity) • Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data Social Network, Semantic Web (RDF), … • Streaming Data You can only scan the data once • A single application can be generating/collecting many types of data • Big Public Data (online, weather, finance, etc.) To extract knowledge all these types of data need to linked together
Velocity (Speed) • Data is begin generated fast and need to be processed fast. • Online Data Analytics • Late decisions missing opportunities • Examples E-Promotions: Based on your current location, your purchase history, what you like send promotions right now for store next to you Healthcare monitoring: sensors monitoring your activities and body any abnormal measurements require immediate reaction Iris flower data set Twitter Firehose (6,000 tweets per second)
Real-time/Fast Data • The progress and innovation is no longer hindered by the ability to collect data • But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data)
Big scientific data • The Earthscope is the world's largest science project. Designed to track North America's geological evolution, this observatory records data over 3.8 million square miles, amassing 67 terabytes of data. It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. CERN’s Large Hydron Collider (LHC) generates 15 PB a year
Hadoop • First up is the all-time classic, and one of the top frameworks in use today. So prevalent is it, that it has almost become synonymous with Big Data. • If your data can be processed in batch, and split into smaller processing jobs, spread across a cluster, and their efforts recombined, all in a logical manner, Hadoop will probably work just fine for you. https://www.kdnuggets.com/2016/03/top-big-data-processing-frameworks.html
Independence - 2008 Hadoop Made its own Top-Level Project NDFS - 2004 Started to write open source version of GFS called Nutch Distributed File System GFS - 2003 Google Published the details about its distributed file system Nutch - 2002 Apache Nutch was Strarted as a part of the Lucene Project Leave Nutch - 2006 they moved out of Nutch to form an independent subproject of Lucene called Hadoop MapReduce - 2004 MapReduce - 2005 Google published a paper introducing MapReduce Nutch algorithms reported to run with MapReduce & NDFS
What is Hive Adata warehouse is a system used for reporting and data analysis, and is considered a core component of business intelligence. DWs are central repositories of integrated data from one or more disparate sources. Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. HiveQL(HQL) is similar to SQL and it automatically translates SQL-like queries into MapReduce jobs
HBase • HBase is a non-relational database that allows for low-latency, quick lookups in Hadoop. It adds transactional capabilities to Hadoop, allowing users to conduct updates, inserts and deletes. EBay and Facebook use HBase heavily.
Apache Sqoop • Sqoop is a connectivity tool for moving data from non-Hadoop data stores – such as relational databases and data warehouses – into Hadoop. It allows users to specify the target location inside of Hadoop and instruct Sqoop to move data from Oracle, Teradata or other relational databases to the target
Apache Flume Flume is a framework for populating Hadoop with data. Agents are populated throughout ones IT infrastructure – inside web servers, application servers and mobile devices, for example – to collect data and integrate it into Hadoop.
Apache Mahout • Mahout is a data mining library. It takes the most popular data mining algorithms for performing clustering, regression testing and statistical modeling and implements them using the Map Reduce model.
Apache Spark Scala, Java, Python, R, SQL ML Pipelines DataFrames Spark SQL Spark Streaming MLlib GraphX Spark Core Data Sources Hadoop HDFS, HBase, Hive, Apache S3, Streaming, JSON, MySQL, and HPC-style (GlusterFS, Lustre) Apache Spark supports data analysis, machine learning, graphs, streaming data, etc. It can read/write from a range of data types and allows development in multiple languages.
Hadoop up and running • Quick Start VMs • Cloud Based Implementations • Your own set-up
Installing Hadoop • Setup Modes • Single Node : Name Node and Data Node are installed on the same single machine • Multi Node : At least two different machines, one running Name Node as Master the other Data Node as Slave • Requirements • Linux/Unix Operating System • Open SSH Server • Java Development Kit
Minimum Configurations • To have a running cluster the following tools must be configured at least: • Hadoop Core (core-site.xml) • HDFS (hdfs-site.xml) • YARN (yarn-site.xml) • Map Reduce (mapred-site.xml)
Installation steps • Set up the Linux Environment (Ubuntu 16.04) • Network Configuration • Set static IPv4 Address • Edit /etc/hosts • Setup SSH • Install Open SSH server • Configure password less SSH connection • Install Java • Install Hadoop • Extract binary distribution • Edit Configuration Files
Hands on Installing a single node Hadoop instance