250 likes | 350 Views
DROSS Distributed & Resilient Open Source Software. Andrew Hardie http://ashardie.com ECPRD WGICT 17-21 November 2010 Chamber of Deputies, Bucharest. Topics. Distributed, not virtualized or ‘cloud’ DRBD Gluster Heartbeat Nginx Trends: NoSQL Map / Reduce Cassandra, Hadoop & family
E N D
DROSSDistributed & Resilient Open Source Software Andrew Hardie http://ashardie.com ECPRD WGICT 17-21 November 2010 Chamber of Deputies, Bucharest ECPRD - WGICT - Bucharest
Topics • Distributed, not virtualized or ‘cloud’ • DRBD • Gluster • Heartbeat • Nginx • Trends: • NoSQL • Map / Reduce • Cassandra, Hadoop & family • Other stuff ‘out there’ • Predictions… ECPRD - WGICT - Bucharest
DRBD • Block-level disk replicator (effectively, net RAID-1) ECPRD - WGICT - Bucharest
DRBD – Good/bad points • Good for HA clusters (e,g, LAMP servers) • Ideal for block-level apps, e.g. MySQL • Sync/Async operation • Auto recovery from disk, net or node failure • In Linux kernels from 2.6.33 (Ubuntu 10.10 is 2.6.35) • Supports Infiniband, LVM, XEN, Dual primary config • Hard to extend beyond two systems, three is maximum • Remote offsite really needs DRBD Proxy (commercial) • Requires dedicated disk/partition • Moderately difficult to configure • Documentation could be better ECPRD - WGICT - Bucharest
Gluster • Filesystem-level replicator • More like NAS than RAID • Claims to scale to petabytes • Nodes can be servers, clients or both • On the fly reconfig of disks & nodes • Scripting interface • ‘Cloud compliant’ (isn’t everything?) ECPRD - WGICT - Bucharest
Gluster – Use case - DublinReal-time mirroring of Digital Audio ECPRD - WGICT - Bucharest
Gluster – Good/bad points • Moving to “turnkey system” (black box) • N-way replication easy • Easier than DRBD to configure • Dedicated partitions or disks not required • Supports Infiniband • Background self-healing (pull rather than push) • Aggregate and/or replicate volumes • POSIX support • Native support for NFS, CIFS, HTTP & FTP • No specific features for slow link replication • Similar documentation vs revenue earning tension ECPRD - WGICT - Bucharest
Heartbeat • HA Cluster infrastructure (“cluster glue”) • Needs Cluster Resource manager (CRM), e.g. Pacemaker, to be useful • Part of the Linux-HA project • Provides: • hot-swap of synthetic IP address between nodes (Synthetic IP is in addition to node’s own IPs) • Node failure/restore detection • Start/stop of services to be managed, via init scripts ECPRD - WGICT - Bucharest
Heartbeat/DRBD – use caseHA LAMP Server pair ECPRD - WGICT - Bucharest
Heartbeat – good/bad points • Lots of resource agents available • e.g. Apache, Squid, Sphinx search, VMWare, DB2, WebSphere, Oracle, JBOSS, Tomcat, Postfix, Informix, SAP, iSCSI, DRBD, … • Beyond simple 2-way hot-swap, config can get very complicated • Good for stateless (e.g. HTTP); not so good for file shares (e.g. Samba) • Documentation out of date in some areas, e.g. Ububtu ‘upstart’ scripts (boot-time startup of services to be managed by Heartbeat has to be disabled) ECPRD - WGICT - Bucharest
NGINX • Fast, simple Russian HTTP server • Reverse proxy server • Mail proxy server • Fast static content serving • Very low memory footprint • Load balancing and fault tolerance • Name and IP based virtual servers • Embedded Perl • FLV streaming • Non-threaded, event-driven architecture • Modular architecture • Can front-end Apache (instead of mod_proxy) ECPRD - WGICT - Bucharest
Trends – NoSQL, etc… • NoSQL • Or, is it really NoACID (atomicity, consistency, isolation, durability)? • It’s really the ACID that’s hard to scale, esp. in the very large, very active data stores (e.g. SN) • Some NoSQLs now have SQL for query only • Ways of solving ACID scalability being discussed • The problems: • Huge numbers of simultaneous updates • Large JOINs across very large tables (= big SQL query) • Lots of updates & searches on small data elements in vast data sets • The alternative: • Key/value stores • De-normalized data ECPRD - WGICT - Bucharest
Consequences of De-normalizing • Order(s) of magnitude increase in storage requirements • Difficulty of updating numerous “Key equivalents” in many places – can’t be done synchronously • Breaking relationship links allows parallel processing: • helps the bottleneck of storage read speed (storage capacity is growing much faster than transfer rates) • No JOINs or transactions ECPRD - WGICT - Bucharest
Name/Value Models • Just name/value pairs, e.g. memcachedb, Dynamo • Name/value pairs plus associated data, e.g. CouchDB, MongoDB – think document stores with metadata • Name/value pairs with nesting, e.g. Cassandra ECPRD - WGICT - Bucharest
Cassandra • Distributed, fault-tolerant database, based on ideas in Dynamo (Amazon) & BigTable (Google) • Developed by FaceBook, open-sourced in 2008 • Now Apache project • Key/value pairs, in column-oriented format • Standard column: name, value, timestamp • Super-column: name, map of columns, each with name, value, timestamp (think array of hashes) • Grouped by Column family, also either standard or super • Column family contains ‘rows’, roughly like a DB table • Column families then go in key-spaces ECPRD - WGICT - Bucharest
Cassandra - NoACID • Cassandra, et al, e.g. Voldemort (LinkedIn), trade speed, distribution and availability for consistency and atomicity • No single point of failure • “Eventually consistent” model • Tunable levels of consistency • Atomicity only guaranteed within a column family • Accessed using Thrift (also developed by Facebook) • Used by: • Facebook • Digg • Twitter • Reddit ECPRD - WGICT - Bucharest
NoSQL for Parliaments? • Much parliamentary material is naturally unstructured and suited to the name/value model (think XML) • Remember the old discussions about how to map such parliamentary material into relational databases? • Think of every MPs contribution (speech) in chamber or committee as a key/value pair, i.e. a column • Think of every PQ & answer as a super-column of name/value pairs for question, answer, holding, supplementary, pursuant, referral … • Hansard becomes a super-column family! ECPRD - WGICT - Bucharest
Map / Reduce • Column (or record) oriented design & de-normalized data power the parallel “map reduce” model (think “sharding on speed”) ECPRD - WGICT - Bucharest
Hadoop • Nothing to do with NoSQL • Hadoop is an infrastructure and now family of tools for managing distributed systems and immense datasets • How immense? Hundreds of GB and 10 node cluster is ‘entry-level’ in Hadoop terms • Developed by Yahoo for their cloud, now Apache project • Supports Map/Reduce by pre-dividing & distributing data • “Moves computation to the data instead of data to the computation” • HDFS file system particularly interesting – distributed, resilient (far more advanced than DRBD or Gluster), but not real time (more eventually consistent…) • Hive data warehouse front end – has SQL-like queries ECPRD - WGICT - Bucharest
Who uses Hadoop? • Twitter • AOL • IBM • Last.fm • LinkedIn • E-Bay • Yahoo • 36,000 machines with > 100,000 cores running Hadoop • Largest cluster is only 4000 nodes • Largest known cluster is Facebook! • 2000 machines with 22,400 cores • 21Petabytes in a single HDFS store ECPRD - WGICT - Bucharest
Hadoop for Parliaments? • Hadoop may seem overkill for parliaments now… • But, when you start your legacy collection digitization and digital preservation projects its model, for managing large datasets which essentially do not change & don’t need real-time commit, is very good fit! • Other interesting Hadoop projects: • Zookeeper (distributed apps co-ordination) • Hive (data warehouse infrastructure) • Pig (high-level data flow language) • Mahout (scalable machine learning library) • Scribe (for aggregating streaming log data) [not strictly Hadoop project, but can be integrated with it, using interesting work-around for the non-real time & NameNode single point of failure] ECPRD - WGICT - Bucharest
Other things ‘out there’ • Drizzle • A database “optimized for Cloud infrastructure and Web applications” • “Design for massive concurrency on modern multi-cpu architecture” • But, doesn’t actually explain how to use it for these… • It’s SQL and ACID • Mostly seems to be a reaction against what’s happening at MySQL… • Has to be compiled from source – no distros available for it yet • CouchDB • Distributed, fault-tolerant, schema-free document-oriented database • RESTful JSON API (i.e. Web front end) • Incremental replication with bi-directional conflict detection • Written in Erlang (highly reliable language developed by Ericsson) • Supports ‘map/reduce’ like querying and indexing • Interesting model, different from most other offerings • Also now an Apache project • Still too immature for anything beyond experimentation ECPRD - WGICT - Bucharest
Also ‘out there’ • Voldemort • Another distributed key/value storage system • Used at LinkedIn • Doesn’t seem to have much future • Cassandra is similar, better & more widely used • MonetDB • “database system for high-performance applications in data mining, OLAP, GIS, XML Query, text and multimedia retrieval “ • SQL and XQUERY front ends • Also hard to see where it’s going… • MongoDB • Tries to bridge the gap between RDBMS and map/reduce • JSON document storage (like CouchDB) • No JOINs, no transactions • Supports atomic transactions only on single documents • Interesting, but may ‘fall between two stools’ ECPRD - WGICT - Bucharest
Predictions • Hadoop and Cassandra are the ones to watch • There will likely be some sort of re-convergence between NoSQL and query languages of some kind – can’t do everything with map/reduce (esp. not ad hoc queries) • SQL may be destined to become like COBOL – still around and running things but not something to use for new projects • Distributed storage models (with or without map/reduce) have good future • Datasets will only get bigger – compliance, audit, digital preservation, the shift to visuals, etc • Information management models (“strategy”) and access speed will remain key problems ECPRD - WGICT - Bucharest
Questions “What’s it all about?” http://ashardie.com ECPRD - WGICT - Bucharest