290 likes | 425 Views
Handling BigData On the Public Cloud. Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools. Based on InterOp 2011 presentation by Liran Zelkha ( Liran.zelkha@scalebase.com ) Co founder of ScaleBase
E N D
Handling BigData On the Public Cloud Dealing with datasets that grow so large that they become awkward to work with using on-hand database management tools.
Based on InterOp 2011 presentation by LiranZelkha (Liran.zelkha@scalebase.com) • Co founder of ScaleBase • Before that, lead Aluna – a database and architecture consulting company • Over 15 years of hands on technology experience
Agenda • What is Big Data • Big Data On Public Clouds • Some solutions
Big Data (from wikipedia) • …are datasets that grow so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing. This trend continues because of the benefits of working with larger and larger datasets allowing analysts to "spot business trends, prevent diseases, combat crime." Though a moving target, current limits are on the order of terabytes, exabytes and zettabytes of data.
Number 3: • ... you get a call from the utility company asking you not to run 'that brownout query' again. (@aristippus303 at Datawatch)
Number 2: • ... it piles up so high that it disappears into the clouds (@evertlammerts - I assume pun was intended?)
Number 1: • ... the SAN undergoes gravitational collapse and you get cited by OSHA for an unlicensed singularity. (@datamartist)
But seriously • Its not a single number • It is a set of parameters
Big Data Parameters Complexity of Analysis Big Data Velocity of Data Volume of Data http://www2.neilmcgovern.com/main.html
Where do we see big data? • Everywhere • Data Warehouse • OLTP • Web 2.0 • SaaS • Billing • Fraud detection • CMS • …Family history • …Social networking
Volume of data • How much data do you have? • The more, the merrier • – Better analysis • Used to be measured in 100’s of GB, then TB, now PB • But even a 300GB DB can still have Big Data problems • “If you have over 1TB of data – you have a Big Data problem”, IDC
Velocity of data • How many users access the data? • How many writes occur on your data? • How much transactions does your database have? • Measured in TPS, counted by the thousands
Complexity of Analysis • How complex are your queries? • An example: • SELECT * FROM ( • SELECT w.*, ROWNUM rnum FROM ( • SELECT distinct w.watcher_id from watch w • left outer join Profile p on p.watcher_id = w.watcher_id • join atom_feedaf on af.resource_id_hash = w.resource_id_hash • join atom_feed_entryafe on afe.atom_feed_id = af.atom_feed_id • where (p.LAST_ENTRY_PROCESSED_DATE is null • or p.LAST_ENTRY_PROCESSED_DATE < afe.create_date) • and (p.email_enabled_flag is null or p.email_enabled_flag != 'F') • and af.resource_id = w.resource_id • and afe.create_date <= sysdate - ? • ORDER BY w.watcher_id ASC ) w • where ROWNUM <= ? ) where rnum > ?;
Again from Wikipedia • – Public cloud or external cloud describes cloud computing in the traditional mainstream sense, whereby resources are dynamically provisioned on a fine-grained, self-service basis over the Internet, via web applications/web services, from an off-site third-party provider who bills on a fine- grained utility computing basis.
Public Cloud Implications • Pros: • Elastic • Unlimited storage • Unlimited capacity • Cons: • Performance • Standard hardware (no appliances...)
Column Store Database • New databases that internally store the data in columns, and not rows. • Very good for OLAP • Excellent for BigData
Again, from Wikipedia: • – NoSQL is the term used to designate database management systems that differ from classic relational database management systems (RDBMSes) in some way. These data stores may not require fixed table schemas, and usually avoid join operations and typically scale horizontally. Academics and papers typically refer to these databases as structured storage, a term that would include classic relational databases as a subset.
NoSQL Database • Non-relational databases • Usually store data in memory, replicated across multiple machines • Great latency
Unstructured Schema • Since SQL is not used, ERD can be dynamic • Some solutions store data as objects of any kind • Some use binary serialization of the object • Others use Map API (put, get) • Players include: Casandra, HiveDB, MemBase, MongoDB
newSQL • Dubbed by 451 analyst Matthew Aslett • "NewSQL" is our shorthand for the various new scalable/high performance SQL database vendors. We have previously referred to these products as 'ScalableSQL' to differentiate them from the incumbent relational database products. Since this implies horizontal scalability, which is not necessarily a feature of all the products, we adopted the term 'NewSQL' in the new report. And to clarify, like NoSQL, NewSQL is not to be taken too literally: the new thing about the NewSQL vendors is the vendor, not the SQL.
New Databases • New database engines • Usually scale very well, can store a lot of data, and targeted for virtual environments • Players include • – NimbusDB • – VoltDB
New MySQL Storage Engines • New databases that look like MySQLfrom the outside • – MySQL network protocol • – MySQL SQL flavor • Players include • – Akiban • – ScaleDB
ScaleBase • ScaleBase offers Database Load Balancers • Scalability and high availability for your database, totally transparent to your application
Summary • There are many ways to handle BigData on cloud environments • Understand your data requirements well – and use the right tool for the job • No one tool fits them all!