590 likes | 705 Views
COP 6727: Advanced Database Systems Spring 2013 Dr. Tao Li Florida International University. Student Self-Introduction. Name I will try to remember your names. But if you have a Long name, please let me know how should I call you Anything you want us to know. Course Overview.
E N D
COP 6727:Advanced Database SystemsSpring 2013Dr. Tao LiFlorida International University
Student Self-Introduction • Name • I will try to remember your names. But if you have a Long name, please let me know how should I call you • Anything you want us to know COP6727
Course Overview • Meeting time • Tuesday and Thursday 12:30pm – 13:45pm • Office hours: • Thursday 2:30pm – 4:30pm or by appointment • Course Webpage: • http://www.cs.fiu.edu/~taoli/class/CAP6727-S13/index.html COP6727
Course Objectives • This is an advanced database course • Already taken COP5725 • Assume knowledge of the fundamental concepts of relational databases. • Cover the core principles and techniques of data and information management • Discuss advanced techniques that can be applied to traditional database systems in order to provide efficient support of new emerging applications. COP6727
Tentative Topics • Query processing and optimization • Transaction management • Database tuning • Data stream systems • Spatial databases • XML • Information retrieval and Web data management • Scalable data processing • Readings in recent developments in database systems and applications • SQL vs. non-SQL database • Nearest neighbor queries • High-dimensional indexing • Database retrieval and ranking • Stream processing • Big Data • Incremental and online query processing • Mobile database COP6727
Assignments and Grading • Reading/Written Assignments • Programing Projects • Midterm Exam • Final Project/Presentations • Class attendance is mandatory. • Evaluation will be a subjective process • Effort is very important component • Regular In-class Students • Quizzes and Class Participation: 5% • Midterm Exam: 30% • Final Project: 30% • Assignments and Projects: 35% • Online Students • Midterm Exam: 30% • Final Project: 30% • Homework Assignments: 40% COP6727
Text and References Raghu Ramakrishnan and Johannes Gehrke. Database Management Systems. Third Edition, McGraw Hill, 2003. ISBN: 0-07-246563-8. Links to Textbook Homepage . In addition, the course materials will also be drawn from recent research literature. COP6727
Lecture 1 & 2 • Lecture 1 & 2: Introduction To MapReduce (Most of slides are adapted from Bill Graham, Spiros Papadimitriou, Cloudera Tutorials) COP6727
Outline • Motivation for MapReduce • What is MapReduce? • What is Hadoop? • What is Hive? COP6727
Motivation for MapReduce • The Big Data • How to handle big data? COP6727
The Big Data • Big data is everywhere • Documents • Blogs (77 million Tumblr and 56.6 million WordPress as of 2012), Micro blogs, News, Reviews • Images • Instagram, Flickr (more than 6 billion images) • Videos • Youtube, All broadcast • Others • Map (Google Map) • Human Genome • aeronautics and space data COP6727
Another view on “big” • 2008: Google processes 20 PB a day • 2009: Facebook has 2.5 PB user data + 15 TB/ day • 2009: eBay has 6.5 PB user data + 50 TB/day • 2011: Yahoo! has 180-200 PB of data • 2012: Facebook ingests 500 TB/day COP6727
Why do we care about those data? • Modeling and predicting information flow • Recommend/predict links in social networks • Relevance classification / information filtering • Sentiment analysis and opinion mining • Topic modeling and evolution • Measuring influence in social networks • Concept mapping • Search • … COP6727
Big data analysis • Scalability (with reasonable cost) • Algorithms improvement • Intuitive way: divide and conquer COP6727
Divide and Conquer COP6727
Challenges • Parallel processing is complicated • How do we assign tasks to workers? • What if we have more tasks than slots? • What happens when tasks fail? • How do you handle distributed synchronization? COP6727
Challenges – Con’t • Data storage is not trivial • Traditional database is not reliable • Data volumes are massive • Reliably storing PBs of data is challenging • Disk/hardware/network failures • Probability of failure event increases with number of machines • For example: • 1000 hosts, each with 10 disks, a disk lasts 3 year • how many failures per day? COP6727
What is MapReduce? • A programming model for expressing distributed computations at a massive scale • An execution framework for organizing and performing such computations • An open-source implementation called Hadoop COP6727
Workflow of Large Data Problem COP6727
MapReduce paradigm • Implement two functions: Map(k1, v1) -> list(k2, v2) Reduce(k2, list(v2)) -> list(v3) • Framework handles everything else* • Value with same key go to same reducer COP6727
MapReduce Flow COP6727
An Example COP6727
MapReduce paradigm – Con’t • There’s more! • Partioners decide what key goes to what reducer • partition(k’, numPartitions) -> partNumber • Divides key space into parallel reducers chunks • Default is hash-based • Combiners can combine Mapper output before sending to reducer • Reduce(k2, list(v2)) -> list(v3) COP6727
MapReduce Flow COP6727
MapReduce additional details • Reduce starts after all mappers complete • Mapper output gets written to disk • Intermediate data can be copied sooner • Reducer gets keys in sorted order • Keys not sorted across reducers • Global sort requires 1 reducer or smart partitioning COP6727
MapReduce is good at • Embarrassingly parallel algorithms • Summing, grouping, filtering, joining • Off-line batch jobs on massive data sets • Analyzing an entire large dataset COP6727
MapReduce can do • Iterative jobs (e.g., PageRank, K-means Clustering) • Each iteration must read/write data to disk • IO and latency cost of an iteration is high COP6727
MapReduce is not good at • Jobs that need shared state/coordination • Tasks are shared-nothing • Shared-state requires scalable state store • Low-latency jobs • Jobs on small datasets • Finding individual records COP6727
Summary of MapReduce • Simple programming model • Scalable, fault-tolerant • Ideal for (pre-)processing large volumes of data COP6727
What is Hadoop? • Hadoop is an open-source implementation based on GFS and MapReduce from Google • Sanjay Ghemawat, Howard Gobioff, and Shun- Tak Leung. (2003) The Google File System • Jeffrey Dean and Sanjay Ghemawat. (2004) MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004 COP6727
Hadoop provides • Redundant, fault-tolerant data storage • Parallel computation framework • Job coordination COP6727
Hadoop Stack COP6727
Who uses Hadoop? • Yahoo! • Facebook • Last.fm • Rackspace • Digg • Apache Nutch • ... COP6727
HDFS • The Hadoop Distributed File System • Redundant storage • Designed to reliably store data using commodity hardware • Designed to expect hardware failures • Intended for large files • Designed for batch inserts COP6727
Some Concepts about HDFS • Files are stored as a collection of blocks • Blocks are 64 MB chunks of a file (configurable) • Blocks are replicated on 3 nodes (configurable) • The NameNode (NN) manages metadata about files and blocks • The SecondaryNameNode (SNN) holds a backup of the NN data • DataNodes (DN) store and serve blocks COP6727
Write COP6727
Read COP6727
If a datanode failures • DNs check in with the NN to report health • Upon failure NN orders DNs to replicate under- replicated blocks COP6727
Jobs and Tasks in Hadoop • Job: a user-submitted map and reduce implementation to apply to a data set • Task: a single mapper or reducer task • Failed tasks get retried automatically • Tasks run local to their data, ideally • JobTracker (JT) manages job submission and task delegation • TaskTrackers (TT) ask for work and execute tasks COP6727
Architecture COP6727
How to handle failed tasks? • JT will retry failed tasks up to N attempts • After N failed attempts for a task, job fails • Some tasks are slower than other • Speculative execution is JT starting up multiple of the same task • First one to complete wins, other is killed COP6727
Data locality • Move computation to the data • Moving data between nodes has a cost • Hadoop tries to schedule tasks on nodes with the data • When not possible TT has to fetch data from DN COP6727
Hadoop execution environment • Local machine (standalone or pseudo- distributed) • Virtual machine • Cloud (e.g. Amazon EC2) • Own cluster COP6727
Demo: word count • Demo COP6727
Homework • Write a Hadoop program to index the words within the text document dataset • Example: • Input: • Doc1: Hello World! • Doc2: Hello Java! • Expected output: • Hello \t Doc1 Doc2 • World \t Doc1 • Java \t Doc2 • Due: beginning of the class on 01/10 • If you have any questions, send emails to Jingxuan Li (jli003@cs.fiu.edu) COP6727
Login Info • Below is the login information for our Hadoop cluster • Server: datamining-node03.cs.fiu.edu • U:dbstudent p:******* (announced during the class) • Gaining the access to the working directory in HDFS (Do not modify or remove the other directories!): hadoop fs -ls /user/dbstudent • Input dataset for the homework (every one will be working on this dataset, so do not modify it!): /user/dbstudent/dataset • Output directory (including the source code, the indexing results) format: /user/dbstudent/output-PID COP6727
What is Hive? • Data warehousing tool on top of Hadoop • Originally developed at Facebook • Now a Hadoop sub-project • Data warehouse infrastructure • Execution: MapReduce • Storage: HDFS files • Large datasets, e.g. Facebook daily logs • 30GB (Jan’08), 200GB (Mar’08), 15+TB (2009) • Hive QL: SQL-like query language COP6727
Motivation • Missing components when using Hadoop MapReduce jobs to process data • Command-line interface for “end users” • Ad-hoc query support • … without writing full MapReduce jobs • Schema information COP6727
Hive Applications • Log processing • Text mining • Document indexing • Customer-facing business intelligence (e.g., Google Analytics) • Predictive modeling, hypothesis testing COP6727
Hive Components • Shell: allows interactive queries like MySQL shell connected to database • Also supports web and JDBC clients • Driver: session handles, fetch, execute • Compiler: parse, plan, optimize • Execution engine: DAG of stages (M/R, HDFS, or metadata) • Metastore: schema, location in HDFS COP6727