250 likes | 413 Views
Homework 4. Code for word count http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-examples/0.20.2-320/org/apache/hadoop/examples/WordCount.java# WordCount. Data Bases in Cloud Environments. Based on: Md . Ashfakul Islam
E N D
Homework 4 • Code for word count • http://grepcode.com/file/repository.cloudera.com/content/repositories/releases/com.cloudera.hadoop/hadoop-examples/0.20.2-320/org/apache/hadoop/examples/WordCount.java#WordCount
Data Bases in Cloud Environments Based on: Md. Ashfakul Islam Department of Computer Science The University of Alabama
Data Today • Data sizes are increasing exponentially everyday. • Key difficulties in processing large scale data • acquire required amount of on-demand resources • auto scale up and down based on dynamic workloads • distribute and coordinate a large scale job on several servers • Replication – update consistency maintenance • Cloud platform can solve most of the above
Large Scale Data Management • Large scale data management is attracting attention. • Many organizations produce data in PB level. • Managing such an amount of data requires huge resources. • Ubiquity of huge data sets inspires researchers to think in new way.
Issues to Consider • Distributed or Centralized application? • How can ACID guarantees be maintained? • CAPS theorem • Consistency, Availability, Partition • Data availability and reliability (even if network partition)are achieved by compromising consistency • Traditional consistency techniques become obsolete • Consistency becomes bottleneck of data management deployment in cloud • Costly to maintain
Evaluation Criteria for Data Management • Evaluation criteria: • Elasticity • scalable, distribute new resources, offload unused resources, parallelizable, low coupling • Security • untrusted host, moving off premises, new rules/regulations • Replication • available, durable, fault tolerant, replication across globe
Evaluation of Analytical DB • Analytical DB handles historical data with little or no updates - no ACID properties. • Elasticity • Since no ACID – easier • E.g. no updates, so locking not needed • A number of commercial products support elasticity. • Security • requirement of sensitive and detailed data • third party vendor store data • potential risk of data leakage and privacy violation • Replication • Recent snapshot of DB serves purpose. • Strong consistency isn’t required.
Analytical DBs - Data Warehousing • Data Warehousing DW - Popular application of Hadoop • Typically DW is relational (OLAP) • but also semi-structured, unstructured data • Can also be parallel DBs (teradata) • column oriented • Expensive, $10K per TB of data • Hadoop for DW • Facebook abandoned Oracle for Hadoop (Hive) • Also Pig – for semi-structured
Evaluation of Transactional DM • Elasticity • data partitioned over sites • locking and commit protocol become complex and time consuming • huge distributed data processing overhead • Security • requirement of sensitive and detailed data • third party vendor store data • potential risk of data leakage and privacy violation
Evaluation of Transactional DM • Replication • data replicated in cloud • CAP theorem: Consistency, Availability, data Partition, only two can be achievable • consistency and availability – must choose one • availability is main goal of cloud • consistency is sacrificed • ACID violation
Transactional Data Management Needed because: • Transactional Data Management • heart of database industry • almost all financial transaction conducted through it • rely on ACID guarantees • ACID properties are main challenge in transactional DM deployment in Cloud.
Scalable Transactions for Web Applications in the Cloud • Two important properties of Web applications • all transactions are short-lived • data request can be responded to with a small set of well-identified data items • Scalable database services like Amazon SimpleDB and Google BigTable allow data to be queried only by primary key. • Eventual data consistency is maintained in these database services.
Relational Joins • Hadoop is not a DB • Debate between parallel DBs and MR for OLAPS • Dewitt/Stonebreaker call MR “step backwards” • Parallel faster because can create indexes
Relational Joins - Example • Given 2 data sets S and T: • (k1, (s1,S1)) k1 is join attribute, s1 is tuple ID, S1 is rest of attributes • (k2, (s2,S2)) • (k1, (t1,T1)) info for T • (k2, (t2,T2)) • S could be user profiles – k is PK, tuple info about age, gender, etc. • T could be logs of online activity, tuple is particular URL, k is FK
Reduce side Join 1:1 • Map over both datasets, emit (join key, tuple) • All tuples grouped by join key – what is needed for join • Which is what type of join? • Parallel sort-merge join • If one-to-one join – at most 1 tuple from S, T match • If 2 values, one must be from S, other from T, (don’t know which since no order), join them
Reduce side Join 1:N • If one to many • If S is one (based on PK) same approach as 1 to 1 will work • But – which one is S? (no ordering) • Solution: buffer all S values in memory • Pick out tuples from S and perform join • Scalability – use memory
Reduce side Join 1:N • Use value-to value conversion • Create composite key: join key and tuple ID • Define sort order so: • sort by join key • Sort by IDs from S first then • Sort by IDS from T • Define partitioner so use only join key, so all keys from with same join key at same reducer
Reduce side Join 1:N • Can remove join key and tuple ID from value to save space • Whenever reducer finds new join key, will be from S and not T, • put into memory (only the S one) • Join with other tuples until next new join key • No more bottleneck
Transactional DM • Transaction is sequence of read & write operations. • Guarantee ACID properties of transactions: • Atomicity - either all operations execute or none. • Consistency - DB remains consistent after each transaction execution. • Isolation - impact of a transaction can’t be altered by another one. • Durability - guarantee impact of committed transaction.
ACID Properties • Atomicity maintained by 2 PC. • Eventual consistency is maintained. • Isolation maintained by decomposing of transaction. • Timestamp ordering is introduced to order conflicting transactions. • Durability is maintained by the replication of data items across several LTMs.
Consistency in Clouds • Consistent database must remain consistent after execution of successful operations. • Inconsistency may cause to huge damage. • Consistency is always sacrificed to achieve availability and scalability. • Strong consistency maintenance in cloud is very costly.
Traditional DM is becoming obsolete. • Thin portable devices and concentrated computing power shows new way. • ACID guarantee become main challenge. • Some solutions are provided to overcome challenge. • Consistency remains bottleneck. • Our goal to provide low cost solutions to ensure data consistency in the cloud.
Current DB Market Status • MS SQL doesn’t support auto scaling and load. • MySQL recommended for “lower traffic” • New products: advertise replace MySQL with us • Oracle recently released on-demand resource allocation • IBM DB2 can auto scale with dynamic workload. • Azure Relational DB – great performance