330 likes | 443 Views
Big Data, NoSQL . . . So What?. Iran Hutchinson. Me. I work for InterSystems who: Drives http : // globalsdb.org NoSQL project . Has 20+ years of NoSQL production deployments Has 20+ years of Big Data production deployments Built a ~250 million Euro business on the above
E N D
Big Data, NoSQL. . . So What? Iran Hutchinson
Me • I work for InterSystems who: • Drives http://globalsdb.orgNoSQL project. • Has 20+ years of NoSQL production deployments • Has 20+ years of Big Data production deployments • Built a ~250 million Euro business on the above • Email: iran.hutchinson@intersystems.com • Twitter: #iranic
Big Data is … • Important data in varying formats and volumes that is being generated across all areas affecting your business that is generally not centrally correlated or managed. • Examples include: • Word Files, PowerPoint, PDFs • Emails, Instant Messaging, Texts • Blogs and Social Media • Automated data from machine activities • Stream data from financial stock markets
Some Big Data Numbers • Source: McKinsey Global Institute • 5 Billion mobile phones used in 2010 • 30 Billion pieces of info shared on Facebook each month • 40% projected growth in global data generated • 235 Terabytes collected by US Library of Congress 04/11 • 15 out of 17 sectors in US have more data stored per company than this.
Some Big Data Numbers … • Source: McKinsey Global Institute • $300 Billion in potential value in US Healthcare system • €250 Billion in Europe’s public sector administration • $600 Billion in annual consumer surplus using location data • 60% Potential increase in retail operating margins • 140,000 – 190,000 analytical talent positions in US • 1.5 Million data-savvy managers needed in US
Case Study: Credit Suisse • Key Challenges: • Revamp order routing architecture • Revamp order management architecture • Serve current demand and scale to new levels • Address downtime challenges
Case Study: Credit Suisse … • Big Data in the form of volumes of transactions • Leveraged Caché’s: • In-memory architecture for performance • On-disk resiliency for availability • Distributed architecture for data coherency • Can easily process 1,000,000,000 transactions • During business hours
Case Study: European Space Agency (ESA) • Key Challenges • Make the largest, most precise 3-D map of our Galaxy • Monitor 1,000,000,000 stars over 5 years, precisely charting position, movement, and brightness • Along the way discover hundreds of thousands of new celestial objects
Case Study: ESA Continued … • Challenge Calculation: • Capture data for 1 Billion Celestial Objects • http://www.intersystems.com/cache/whitepapers/pdf/Charting_the_Galaxy.pdf 1,000,000,000 objects X 100 observations per object X 600 bytes per observation 60,000,000,000,000 (60TB) Solution: Caché/XEP, delivering 100,000+ sustained inserts per second per server, stored as real objects with SQL access
Enabling Technology • Focus on Caché • A quick look at the architecture Paradigm Language
Enabling Technology … • Java + C database kernel run in same process
Enabling Technology … • ECP, Distributed Computing
Enabling Technology … • Multiple, simultaneous data to disk writers Caché Buffer Journalers Hard Disk
Who is this Guy? • Edgar Frank “Ted” Codd • Known for 12 Rules (0 ~ 12) for Relational Data Systems
NoSQL … Breaking the Rules • Rule 1: The information Rule • All information is represented in 1 and only 1 way, namely by values in column positions within rows of tables • Rule 12: The no subversion Rule • If the system provides a low-level (record-at-a-time) interface, then that interface cannot be used to subvert the system i.e. relational security or integrity constraints.
Why NoSQL? • No to ACID transactions • No to the impedance mismatch with SQL • Dealing with Big Data and Web Scale • High prices from RDBMS vendors • Use commodity hardware • Flexible data models • It’s a cool movement ….
Is NoSQL a new Concept? • No • Remember MUMPS? • SET ^Car("Door","Color")="BLUE” • Remember Multi-value/PICK • MATWRITE array.variable ON file.variable,id. …. • Ever heard of the NoSQL RDB? • Carlo Strozzi • http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page
CAP Theorem • Consistent • A service that is consistent operates fully or not. • Availability • The service is available to operate fully or not. • Partition Tolerance • Managing data on multiple nodes. 1 node is 1 partition so it works or does not when it comes to processing data. • Significant as you can get 2 of these only …
CAP Theorem … • Arguments and links • http://www.julianbrowne.com/article/viewer/brewers-cap-theorem • http://ksat.me/a-plain-english-introduction-to-cap-theorem/ • http://voltdb.com/company/blog/clarifications-cap-theorem-and-data-related-errors
Distributed computing • Fallacies (Peter Deutsch) • The network is reliable • Latency is zero • Bandwidth is infinite • The network is secure • Topology doesn’t change • There is one administrator • Transport cost is zero • The network is homogeneous • Remember JINI? (See Apache River project)
NoSQL: Which project? • http://nosql-database.org/ lists 122 today. • Depends on your model selection. • Most likely choose well-known project. • Don’t forget about shared risk!
NoSQL: Querying • Some solutions have no querying • When available query languages differ • Lack of general AD-Hoc querying – “no” SQL • Have you heard of UnQL? • http://www.unqlspec.org/display/UnQL/Home • NOTE: Toad for Cloud
NoSQL: How to Succeed? • Know your application • Don’t forget the past lessons • Consider a hybrid approach • Fight the desire to Roll-Your-Own-DB • Start small but significant
NoSQL: Hybrid Approach 1 • Two Systems • NoSQL System • SQL/RDBMS NoSQL SQL/RDBMS Data Mapper / Translator
NoSQL: Hybrid Approach 2 • One system does both NoSQL and SQL
GlobalsDB.org Project • Name comes from the underlying data structure • Multi-dimensional array • Basis for commercial Caché data system • Free for development and production deployment • NoSQLDB with Java and Node.js APIs • Code base is same as commercial product • APIs are open sourced or being open sourced • Database kernel is not open source
A “Global” Definition • A Global is persistent sparse multi-dimensional array, which consists of one or more storage elements or "nodes". Each node is identified by a node reference (which is, essentially, its logical address) • simple =="some data” • complex["subscript-1", "subscript-2"] =="some data” • Example • product[item,type,os,proccessor] == quantity • product[“computer”,”laptop”,”Mac”,”i7”] == 3
GlobalsDB Architecture • Current Architecture Paradigm Language
GlobalsDB, NoSQL, Big Data • http://nosql.mypopescu.com/ • http://highscalability.com/ • http://nosqltapes.com/ • http://globalsdb.wordpress.com