Chapter 1: Course Infomation and Introduction to Big Data

COMP9313: Big Data ManagementLecturer: Xin CaoCourse web site: http://www.cse.unsw.edu.au/~cs9313/

Chapter 1: Course Infomation and Introduction to Big Data

Course Info • Lectures： 6：00 – 9:00 pm (Thursday) • Location: Ainsworth Building 202 (K-J17-202) • Lab: Weeks 2-13 • Consultation (Weeks 1-12): Questions regardinglectures, course materials, assignements, exam, etc. • Time: 3:00 – 4:00 pm (Thursday) • Place: 201D K-17, • Tutors • Longbin Lai, llai@cse.unsw.edu.au • Jianye Yang, jianyey@cse.unsw.edu.au • Fei Bi, z3492319@cse.unsw.edu.au • Discucssion and QA: https://groups.google.com/forum/#!forum/comp9313-qa

Lecturer in Charge • Lecturer: Xin Cao • Office: 201D K17 (outside the lift turn left) • Email: xin.cao@unsw.edu.au • Ext: 55932 • Research interests • Spatial Database • Data Mining • Data Management • Big Data Technologies • My publications list at google scholar: https://scholar.google.com.au/citations?user=kJIkUagAAAAJ&hl=en

Course Aims • This course aims to introduce you to the concepts behind Big Data, the core technologies used in managing large-scale data sets, and a range of technologies for developing solutions to large-scale data analytics problems. • This course is intended for students who want to understand modern large-scale data analytics systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them efficiently and effectively address challenges in big data management. • Not possible to cover every aspect of big data management.

Lectures • Lectures focusing on the frontier technologies on big data management and the typical applications • Try to run in more interactive mode • A few lectures may run in more practical manner (e.g., like a lab/demo) to cover the applied aspects • Lecture length varies slightly depending on the progress (of that lecture)  • Note: attendance to every lecture is assumed BIG DATA BUG DATA

Resources • Text Books • Hadoop: The Definitive Guide. Tom White. 4th Edition - O'Reilly Media • Mining of Massive Datasets. Jure Leskovec, Anand Rajaraman, Jeff Ullman. 2nd edition - Cambridge University Press • Data-Intensive Text Processing with MapReduce. Jimmy Lin and Chris Dyer. University of Maryland, College Park. • Reference Books and other readings • Advanced Analytics with Spark. Josh Wills, Sandy Ryza, Sean Owen, and Uri Laserson. O'Reilly Media • Apache MapReduce Tutorial • Apache Spark Quick Start • Many other online tutorials … … • Big Data is a relatively new topic (so no fixed syllabus)

Prerequisite • Official prerequisite of this course is COMP9024 (Data Structures and Algorithms) and COMP9311 (Database Systems). • Before commencing this course, you should: • have experiences and good knowledge of algorithm design (equivalent to COMP9024 ) • have a solid background in database systems (equivalent to COMP9311) • have solid programming skills in Java • be familiar with working on a Unix-style operating systems • have basic knowledge of linear algebra (e.g., vector spaces, matrix multiplication), probability theory and statistics , and graph theory • No previous experience necessary in • MapReduce • Parallel and distributed programming

Please do not enrol if you • Don’t have COMP9024/9311 knowledge • Cannot produce correct Java program on your own • Never worked on Unix-style operating systems • Have poor time management • Are too busy to attend lectures/labs • Otherwise, you are likely to perform badly in this subject

Learning outcomes • After completing this course, you are expected to: • elaborate the important characteristics of Big Data • develop an appropriate storage structure for a Big Data repository • utilize the map/reduce paradigm and the Spark platform to manipulate Big Data • use a high-level query language to manipulate Big Data • develop efficient solutions for analytical problems involving Big Data

Assessment

Assignments • 1 warm-up programming assignment on Hadoop • 1 programming assignment on HBase/Hive/Pig • 1 warm-up programming assignment on Spark • Another harder assignment on Hadoop • Another harder assignment on Spark • Both results and source codes will be checked. • If not able to run your codes due to some bugs, you will not lose all marks.

Final exam • Final written exam (100 pts) • If you are ill on the day of the exam, do not attend the exam – I will not accept any medical special consideration claims from people who already attempted the exam.

You May Fail Because … • *Plagiarism* • Code failed to compile due to a mistake of 1 char or 1 word • Late submission • 1 sec late = 1 day late • submit wrong files • Program did not follow the spec • I am unlikely to accept the following excuses: • “Too busy” • “It took longer than I thought it would take” • “It was harder than I initially thought” • “My dog ate my homework” and modern variants thereof

Tentative course schedule

Your Feedbacks Are Important • Big data is a new topic, and thus the course is tentative • The technologies keep evolving, and the course materials need to be updated correspondingly • Please advise where I can improve after each lecturer, at the discussion and QA website • CATEI system

Why Attend the Lectures?

What is Big Data? • Big data is like teenage sex: • everyone talks about it • nobody really knows how to do it • everyone thinks everyone else is doing it • so everyone claims they are doing it... --Dan Ariely, Professor at Duke University

What is Big Data? • No standard definition! here is from Wikipedia: • Big data is a term for data sets that are so large or complex that traditional data processing applications are inadequate. • Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy. • Analysis of data sets can find new correlations to "spot business trends, prevent diseases, combat crime and so on."

Who is generating Big Data? User Tracking & Engagement Homeland Security Social eCommerce Financial Services Real Time Search

Big Data Characteristics: 3V

Volume (Scale) • Data Volume • Growth 40% per year • From 8 zettabytes (2016) to 44zb (2020) • Data volume is increasing exponentially Exponential increase in collected/generated data

How much data? Processes 20 PB a day (2008) Crawls 20B web pages a day (2012) Search index is 100+ PB (5/2014) Bigtable serves 2+ EB, 600M QPS (5/2014) 150 PB on 50k+ servers running 15k apps (6/2011) Hadoop: 365 PB, 330K nodes (6/2014) 400B pages, 10+ PB (2/2014) Hadoop: 10K nodes, 150K cores, 150 PB (4/2014) LHC: ~15 PB a year 300 PB data in Hive + 600 TB/day (4/2014) S3: 2T objects, 1.1M request/second (4/2013) 640Kought to be enough for anybody. LSST: 6-10 PB a year (~2020) SKA: 0.3 – 1.5 EB per year (~2020)

Variety (Complexity) • Different Types: • Relational Data (Tables/Transaction/Legacy Data) • Text Data (Web) • Semi-structured Data (XML) • Graph Data • Social Network, Semantic Web (RDF), … • Streaming Data • You can only scan the data once • A single application can be generating/collecting many types of data • Different Sources： • Movie reviews from IMDB and Rotten Tomatoes • Product reviews from different provider websites To extract knowledge all these types of data need to linked together

A Single View to the Customer Banking Finance Social Media Gaming Our Known History Customer Entertain Purchase

A Global View of Linked Big Data prescription diagnosis drug target protein mutation patient “Ebola” tissue gene doctors Heterogeneous information network Diversified social network

Velocity (Speed) • Data is begin generated fast and need to be processed fast • Online Data Analytics • Late decisions  missing opportunities • Examples • E-Promotions: Based on your current location, your purchase history, what you like  send promotions right now for store next to you • Healthcare monitoring: sensors monitoring your activities and body  any abnormal measurements require immediate reaction • Disaster management and response

Real-Time Analytics/Decision Requirement Product Recommendations that are Relevant & Compelling Learning why Customers Switch to competitors and their offers; in time to Counter Influence Behavior Friend Invitations to join a Game or Activity that expands business Customer Improving the Marketing Effectiveness of a Promotion while it is still in Play Preventing Fraud as it is Occurring & preventing more proactively

Extended Big Data Characteristics: 6V • Volume: In a big data environment, the amounts of data collected and processed are much larger than those stored in typical relational databases. • Variety: Big data consists of a rich variety of data types. • Velocity: Big data arrives to the organization at high speeds and from multiple sources simultaneously. • Veracity: Data quality issues are particularly challenging in a big data context. • Visibility/Visualization: After big data being processed, we need a way of presenting the data in a manner that’s readable and accessible. • Value: Ultimately, big data is meaningless if it does not provide value toward some meaningful goal.

Veracity (Quality & Trust) • Data = quantity + quality • When we talk about big data, we typically mean its quantity: • What capacity of a system provides to cope with the sheer size of the data? • Is a query feasible on big data within our available resources? • How can we make our queries tractable on big data? • . . . • Can we trust the answers to our queries? • Dirty data routinely lead to misleading financial reports, strategic business planning decisionloss of revenue, credibility and customers, disastrous consequences • The study of data quality is as important as data quantity

Data in real-life is often dirty 81 million National Insurance numbers but only 60 million eligible citizens 98000 deaths each year, caused by errors in medical data 500,000 dead people retain active Medicare cards

Visibility/Visualization • Visible to the process of big data management • Big Data – visibility = Black Hole? • Big data visualization tools: A visualization of Divvy bike rides across Chicago

Value • Big data is meaningless if it does not provide value toward some meaningful goal

Big Data: 6V in Summary Transforming Energy and Utilities through Big Data & Analytics. By Anders Quitzau@IBM

Other V’s • Variability • Variability refers to data whose meaning is constantly changing. This is particularly the case when gathering data relies on language processing. • Viscosity • This term is sometimes used to describe the latency or lag time in the data relative to the event being described. We found that this is just as easily understood as an element of Velocity. • Virality • Defined by some users as the rate at which the data spreads; how often it is picked up and repeated by other users or events. • Volatility • Big data volatility refers to how long is data valid and how long should it be stored. You need to determine at what point is data no longer relevant to the current analysis. • More V’s in the future …

Big Data Tag Cloud

Cloud Computing • The buzz word before “Big Data” • Larry Ellison’s response in 2009 • Cloud Computing is a general term used to describe a new class of network based computing that takes place over the Internet • A collection/group of integrated and networked hardware, software and Internet infrastructure (called a platform). • Using the Internet for communication and transport provides hardware, software and networking services to clients • These platforms hide the complexity and details of the underlying infrastructure from users and applications by providing very simple graphical interface or API • A technical point of view • Internet-based computing (i.e., computers attached to network) • A business-model point of view • Pay-as-you-go (i.e., rental)

Cloud Computing Architecture

Cloud Computing Services

Cloud Computing Services • Infrastructure as a service (IaaS) • Offering hardware related services using the principles of cloud computing. These could include storage services (database or disk storage) or virtual servers. • Amazon EC2, Amazon S3 • Platform as a Service (PaaS) • Offering a development platform on the cloud. • Google’s Application Engine, Microsofts Azure • Software as a service (SaaS) • Including a complete software offering on the cloud. Users can access a software application hosted by the cloud vendor on pay-per-use basis. This is a well-established sector. • Googles gmail and Microsofts hotmail, Google docs

Cloud Services Software as a Service (SaaS) Platform as a Service (PaaS) Infrastructure as a Service (IaaS) SalesForce CRM LotusLive Google App Engine

Why Study Big Data Technologies? • The hottest topic in both research and industry • Highly demanded in real world • A promising future career • Research and development of big data systems: distributed systems (eg, Hadoop), visualization tools, data warehouse, OLAP, data integration, data quality control, … • Big data applications: social marketing, healthcare, … • Data analysis: to get values out of big data discovering and applying patterns, predicative analysis, business intelligence, privacy and security, … • Graduate from UNSW

Big Data Open Source Tools

What will the course cover • Topic 1. Big data management tools • Apache Hadoop • MapReduce • HDFS • HBase • Hive and Pig • Mahout • Spark • Topic 2. Big data typical applications • Link analysis • Graph data processing • Data stream mining • Some machine learning topics

Philosophy to Scale for Big Data Processing Divide Work Combine Results

Distributed processing is non-trivial • How to assign tasks to different workers in an efficient way? • What happens if tasks fail? • How do workers exchange results? • How to synchronize distributed tasks allocated to different workers?

Big data storage is challenging • Data Volumes are massive • Reliability of Storing PBs of data is challenging • All kinds of failures: Disk/Hardware/Network Failures • Probability of failures simply increase with the number of machines …

What is Hadoop • Open-source data storage and processing platform • Before the advent of Hadoop, storage and processing of big data was a big challenge • Massively scalable, automatically parallelizable • Based on work from Google • Google: GFS + MapReduce + BigTable (Not open) • Hadoop: HDFS + Hadoop MapReduce + HBase（opensource) • Named by Doug Cutting in 2006 (worked at Yahoo! at that time), after his son's toy elephant.

Hadoop offers • Redundant, Fault-tolerant data storage • Parallel computation framework • Job coordination Q: Where file is located? No longer need to worry about Q: How to handle failures & data lost? Q: How to divide computation? Programmers Q: How to program for scaling?

Why Use Hadoop? • Cheaper • Scales to Petabytes or more easily • Faster • Parallel data processing • Better • Suited for particular types of big data problems

Chapter 1: Course Infomation and Introduction to Big Data

Chapter 1: Course Infomation and Introduction to Big Data

Presentation Transcript

Chapter 1 Introduction to Data Mining

Introduction to Big Data and H adoop

Introduction to Big Data

An Introduction to Data Intensive Computing Chapter 3: Processing Big Data

Microsoft Big Data Essentials Module 1 - Introduction to Big Data

Introduction to Big Data and NoSQL

Chapter 1: Introduction to Data Mining, Warehousing, and Visualization

Chapter 1 Introduction to Statistics and Data Collection

Chapter 1 Introduction to Data Mining

CHAPTER 1: INTRODUCTION TO DATA INTEGRATION

Introduction to Big Data

Chapter 1 Introduction and Data Collection

Microsoft Big Data Essentials Module 1 - Introduction to Big Data

Introduction to big data #inspiringcareers

Big Data Training | Big Data Course | Big Data Online Courses

introduction to BIG DATA

big data course

Big Data Introduction

Big Data Visualization | Big Data Course Bangalore | GKT