Big Data and Its Technologies

Big Data and Its Technologies CISC 6930 Data Mining

What We Are Going to Learn • What is Big Data? • Characteristics of Big Data • What To Do With The Data? • What Technology Do We Have For Big Data? • A Simple Big Data Mining Example • Hadoop in the Wild • Big Data in the Cloud

Imagine:You are working in a company, tomorrow morning you go to your office and there’s a mail from your CEO regarding a new task: Dear <Your Name>, As you know we are building a blogging platform blogger2.com, I need some statistics. I need to find out, across all blogs ever written on blogger.com, how many times one character words occur (like 'a', 'I'), How many times two character words occur (like 'be', 'is')…, and so on till how many times do ten character words occur. I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with you on this for a week. I am going on a vacation for a week, and its really important that I've this when I return. Good luck. regds, The CEO P.s : and one more thing. Everything has to be done manually, except going to the blog and copy pasting it on notepad. I read somewhere that if you write programs, google can find out about it

Picture yourself in that position for a moment, like CEO. • You have 50,000 people to work for you for a week. And you need to find out the number of onecharacter words, No. of twocharacter words etc., covering the maximum number of blogs in BlogSpot. • Finally you need to give a report to your CEO with something like this: • Occurrence of one character words – Around 937688399933 • Occurrence of twocharacter words – Around 23388383830753434 • .. hence forth till ten • If homicide, suicide or resigning the job is not an option, how would you solve it? • How would you avoid the chaos of so many people working? • How will you co-ordinate those many since the output of one has to be merged with another?

The Big Questions • What is Big Data? • What makes Data “Big”? • How to manage very large amounts of data and extract value and knowledge from them?

What is Big Data? • No single standard definition… “Big Data” is data whose scale, diversity, and complexity require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it…

What is Big Data? Here is from Wikipedia: • Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. • The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. • The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases, link legal citations, combat crime, and determine real-time roadway traffic conditions.”

Big Data EveryWhere! • Lots of data is being collected and warehoused • Web data, e-commerce • Purchases at department/grocery stores • Bank/Credit Card transactions • Social Network

How Much Data? • Man on the moon with 32KB (1969); my laptop had 8GB RAM (2013) • Google collects 270PB data in a month (2007), 20PB a day (2008) • Facebook has 2.5 PB of user data + 15 TB/day (4/2009) • eBay has 6.5 PB of user data + 50 TB/day (5/2009) 640Kought to be enough for anybody.

How Much Data? • 2.7 Zetabytes of data exist in the digital universe today. • 235 Terabytes of data has been collected by the U.S. Library of Congress in April 2011. • The Obama administration is investing $200 million in big data research projects. • According to estimates, the volume of business data worldwide, across all companies, doubles every 1.2 years. • 140,000 to 190,000, too many people with deep analytical skills to fill the demand of Big Data jobs in the U.S. by 2018.

We Are in a Knowledge Economy • Data is an important asset to any organization • Discovery of knowledge • Enabling discovery • Annotation of data • We are looking at newer • Programming models, and • Supporting algorithms and data structures. • NSF refers to it as “data-intensive computing” and industry calls it “big-data” and “cloud computing”

What We Are Going to Learn • What is Big Data? • Characteristics of Big Data • What To Do With The Data? • What Technology Do We Have For Big Data ?? • A Simple Big Data Mining Example • Hadoop in the Wild • Big Data in the Cloud

Characteristics of Big Data: 1-Scale (Volume/Scale) • Data Volume • 44x increase from 2009 2020 • From 0.8 zettabytes to 35zb • Data volume is increasing exponentially Exponential increase in collected/generated data

CERN’s Large Hydron Collider (LHC) generates 15 PB a year CERN’s Large Hydron Collider (LHC) generates 15 PB a year

1. The Earthscope • The Earthscope is the world's largest science project. • Designed to track North America's geological evolution • This observatory records data over 3.8 million square miles, amassing 67 terabytes of data. • It analyzes seismic slips in the San Andreas fault, sure, but also the plume of magma underneath Yellowstone and much, much more. (http://www.msnbc.msn.com/id/44363598/ns/technology_and_science-future_of_technology/#.TmetOdQ--uI)

4.6 billion camera phones world wide 30 billion RFID tags today (1.3B in 2005) • 12+ TBsof tweet data every day 100s of millions of GPS enabled devices sold annually ? TBs ofdata every day • 25+ TBs oflog data every day 2+ billion people on the Web by end 2011 76 million smart meters in 2009… 200M by 2014

Characteristics of Big Data: 2-Complexity (Variety- Complexity) • Various formats, types, and structures • Text, numerical, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc… • Static data vs. streaming data • A single application can be generating/collecting many types of data To extract knowledge all these types of data need to linked together

Characteristics of Big Data: 2-Complexity (Variety- Complexity) • Types of Data • Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data • Social Network, Semantic Web (RDF), … • Streaming Data • You can only scan the data once • A single application can be generating/collecting many types of data • Big Public Data (online, weather, finance, etc.)

A Single View to the Customer Banking Finance Social Media Gaming Our Known History Customer Entertain Purchase

Real-Time Analytics/Decision Requirement Product Recommendations that are Relevant & Compelling Learning why Customers Switch to competitors and their offers; in time to Counter Influence Behavior Friend Invitations to join a Game or Activity that expands business Customer Improving the Marketing Effectiveness of a Promotion while it is still in Play Preventing Fraud as it is Occurring & preventing more proactively

Characteristics of Big Data: 3-Speed (Velocity) • Data begins generated fast and need to be processed fast • Online Data Analytics • Late decisions  missing opportunities • Examples • E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you • Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction

Characteristics of Big Data: 3-Speed (Velocity) • Real-time/Fast Data Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data) • The progress and innovation is no longer hindered by the ability to collect data • But, by the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion

3 Vs of Big Data • The “BIG” in big data isn’t just about volume

Some Make it 4V’s

Harnessing Big Data • OLTP: Online Transaction Processing (DBMSs) • OLAP: Online Analytical Processing (Data Warehousing) • RTAP: Real-Time Analytics Processing (Big Data Architecture & technology)

What To Do With These Data? • Aggregation and Statistics • Data warehouse and OLAP • Indexing, Searching, and Querying • Keyword based search • Pattern matching (XML/RDF) • Knowledge discovery • Data Mining • Statistical Modeling

Who’s Generating Big Data • The progress and innovation is no longer hindered by the ability to collect data, but • By the ability to manage, analyze, summarize, visualize, and discover knowledge from the collected data in a timely manner and in a scalable fashion Mobile devices (tracking all objects all the time) Social media and networks (all of us are generating data) Scientific instruments (collecting all sorts of data) Sensor technology and networks (measuring all kinds of data)

The Model Has Changed… • The Model of Generating/Consuming Data has Changed Old Model: Few companies are generating data, all others are consuming data New Model: all of us are generating data, and all of us are consuming data

The Evolution of Business Intelligence Interactive Business Intelligence & In-memory RDBMS QliqView, Tableau, HANA Big Data: Real Time & Single View Graph Databases Speed Scale BI Reporting OLAP & Dataware house Business Objects, SAS, Informatica, Cognos other SQL Reporting Tools Big Data: Batch Processing & Distributed Data Store Hadoop/Spark; HBase/Cassandra Speed Scale 1990’s 2000’s 2010’s

Value of Big Data Analytics • Big data is more real-time in nature than traditional DW applications • Traditional DW architectures (e.g. Exadata, Teradata) are not well-suited for big data apps • Shared nothing, massively parallel processing, scale out architectures are well-suited for big data apps

Challenges in Handling Big Data • The Bottleneck is in technology • New architecture, algorithms, techniques are needed • Also in technical skills • Experts in using the new technology and dealing with big data

Big Data Landscape Apps Data as a service Infras- tructure Technology

Big Data Technology

Hadoop/MapReduce Technology • What is Hadoop and why does it matter? • Hadoop is the core platform for structuring Big Data. • Hadoop is an open-source software framework for structuring and storing data and running applications on clusters of commodity hardware • Hadoop uses a distributed computing architecture consisting of many servers • It also solves the problem of formatting it for analytic purposes. • A storage part, known as Hadoop Distributed File System (HDFS) • A processing part called MapReduce.

Hadoop/MapReduce Technology • Hadoop was created by Doug Cutting and Mike Cafarella in 2005. Cutting, who was working at Yahoo! at the time, named it after his son's toy elephant. • It was originally developed to support distribution for the Nutch search engine project. • The objective of design is to answer a question: “How to process big data with reasonable cost and time?”

Hadoop/MapReduce Technology • Why Hadoop is important? It is a flexible, scalable, and highly-available architecture for distributed computation and data processing on a network of commodity hardware.

Let’s Have A Simple Big Data Mining Example • Tomorrow morning you go to your office and there’s a mail from your CEO regarding a new work: Dear <Your Name>, As you know we are building the blogging platform blogger2.com, I need some statistics. I need to find out, across all blogs ever written on blogger.com, how many times one character words occur (like 'a', 'I'), How many times two character words occur (like 'be', 'is').. and so on till how many times do ten character words occur. I know its a really big job. So, I will assign, all 50,000 employees working in our company to work with you on this for a week. I am going on a vacation for a week, and its really important that I've this when I return. Good luck. regds, The CEO P.s : and one more thing. Everything has to be done manually, except going to the blog and copy pasting it on notepad. I read somewhere that if you write programs, google can find out about it

Let’s Have A Simple Big Data Mining Example • Chapter 1: Picture yourself in that position for a moment. • Picture yourself in that position for a moment. • You have 50,000 people to work for you for a week. And you need to find out the number of onecharacter words, No. of twocharacter words etc., covering the maximum number of blogs in BlogSpot. • Finally you need to give a report to your CEO with something like this: • Occurrence of one character words – Around 937688399933 • Occurrence of twocharacter words – Around 23388383830753434 • .. hence forth till ten • If homicide, suicide or resigning the job is not an option, how would you solve it? • How would you avoid the chaos of so many people working? • How will you co-ordinate those many since the output of one has to be merged with another?

How to Mine the Data?Or How to Solve it

Let’s Have A Simple Big Data Mining Example • Chapter 2: Proclamation: Let there be caste • The next day, you stand with a mike on the day before 50,000 and proclaim. • For a week, you will all be divided into many groups: • The Mappers (tens of thousands of people will be in this group) • The Grouper (assume just one guy for now) • The Reducers (around 10 of employees) and.. • The Master (that’s you). • Then you talk to each one of the groups.

Let’s Have A Simple Big Data Mining Example • Chapter 3: Your talk with the Mappers • Each mapper will get a set of 50 blog urls and really Big sheet of paper. • Each one of you need to go to each of that url, and for each word in those blogs, write one line on the paper. • The format of that line should be the number of characters in the word, then a comma, and then the actual word. • For example, if you find the word “a”, you write “1,a”, in a new line in your paper. since the word “a” has only 1 character. If you find the word “hello”, you write “5,hello” on the new line. • Each take 4 days. So, After 4 days, your sheet might look like this • “1,a” • “5,hello” • “2,if” • .. and a million more lines At the end of the 4th day, each one of you will give your sheet completely filled to the Grouper

Let’s Have A Simple Big Data Mining Example • Chapter 4: Your talk with the Grouper Someone gives you 10 papers. The first paper will be marked one, the second paper will be marked two, and so on, till ten. You collect the output from mappers and for each line in the mapper’s sheet, if it says “one,”, your write the on sheet one, if it says “two, ”, you write it on sheet two. For example, if the first line of a mapper’s sheet says “1,a”, you write “a” on sheet 1. if it says “2,if”, your write “if” on sheet 2. If it says “5,hello”, you write hello on sheet 5.

Let’s Have A Simple Big Data Mining Example • Chapter 4: Your talk with the Grouper • So at the end of your work, the 10 sheets you have might look like this • Sheet 1: a, a ,a , I, I , i, a, i, i, i…. millions more • Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of … millions more • Sheet 3 :the, the, and, for, met, bet, the, the, and, … millions more • .. • Sheet 10: …… Once you are done, you distribute, each sheet to one reducer. For example sheet 1 goes to reducer 1, sheet 2 goes to reducer 2 and so on.

Let’s Have A Simple Big Data Mining Example • Chapter 5: Your talk with the Reducers: Each one of you gets one sheet from the grouper. For each sheet you count the number of words written on it and write it in big bold letters on the back side of the paper. For example, if you are reducer 2. You get sheet 2 from the grouper that looks like this: “Sheet 2: if, of, it, of, of, if, at, im, is,is, of, of …” You count the number of words on that sheet, say the number of words is 28838380044, you write it on the back side of the paper, in big bold letters and give it to me (the master).

Let’s Have A Simple Big Data Mining Example • Chapter 6: The controlled Chaos and the climax: At the end of this process you have 10 sheets. Sheet 1, having the count of the number of words with 1 character on the back side. Sheet2, having the count of the number words with 2 characters on the back side. It is done. Genius!

Let’s Have A Simple Big Data Mining Example • Comments • You essentially did map reduce. The greatest advantage in your approach was • the Mappers can work independently • the Reducers can work independently • the Grouper can work really fast • The process can be easily applied to other kinds of problems. In such a case : • The work of the Master (dividing the work) and the Grouper (Grouping the values by key [the value before comma]), remains the same. This is what any map-reduce library provides. • The work of the Mappers and Reducers differ according to the problem. This is what you should write.

Big Data and Its Technologies