370 likes | 492 Views
NoSQL with Mongo DB and R. Bob Wakefield (I know stuff.) Sit down. Strap in. H ang on. NoSQL with Mongo DB and R. This is a discussion. Feel free to hop in and correct me if I say something crazy. WARNING!. “We’ll talk about this a little bit later.”. Motivations for this presentation.
E N D
NoSQL with Mongo DB and R • Bob Wakefield • (I know stuff.) • Sit down. Strap in. Hang on.
NoSQL with Mongo DB and R • This is a discussion. Feel free to hop in and correct me if I say something crazy.
WARNING! • “We’ll talk about this a little bit later.”
Motivations for this presentation • Kaggle competition experience • Recent experience on a client site • NoSQL skills starting to be in demand
RecSys Challenge 2013: Yelp business rating prediction • Build a recommender system based on user ratings • AKA ETL Hell
System 1 Schema changes break everything down stream!
Wakefield Career Management System • Step 1: Scan market for high value skills. • Step 2: Acquire skills. • Step 3: Sell skills to the highest bidder. • Step 4: Get Paid.
Sample of Job Board Post • Designs NoSQL dynamic schemas to leverage simplicity and power of NoSQL • Experience with NoSql-based databases, such as MongoDB • Knowledge of NoSQL databases (MongoDB, Hadoop, Couch DB, etc…) • Five years' experience with NoSQL, columnar, and key/value databases
Mongo DB ETL Example • Data from Yelp Kaggle Competition • Data in JSON format • 229,907 reviews { 'type': 'review', 'business_id': (encrypted business id), 'user_id': (encrypted user id), 'stars': (star rating), 'text': (review text), 'date': (date, formatted like '2012-03-14', %Y-%m-%d in strptime notation), 'votes': {'useful': (count), 'funny': (count), 'cool': (count)} }
Main Topics for this Evening • NoSQL • MongoDB • Modeling unstructured data • MongoDB and R
Intended Audience • Business/Data Analyst • Data Architect • General Scenario • You have read only access to MongoDB and need to retrieve your own data.
People I’m Going to Ignore • Software Developers • DBAs
Source Material - Books • No SQL Distilled by P. Sadalage and M. Fowler • MongoDB Applied Design Patters by R. Copeland • The Definitive Guide to MongoDB by Plugge, Membry and Hawkins • MongoDB online docs
Source Material - YouTube • Introduction to NoSQL by Martin Fowler – GOTO conferences • Workshop: NoSQL Data Modelling (Jan Steemann) Teil2 – ArangoDB • NoSQL Data Modelling for Scalable eCommerce – Dataversity Net • Domain Driven Design – Zend • Webinar on the rmongodb R package - comsystotv
Why NoSQL • Handles Schema Changes Well (easy development) • Solves Impedance Mismatch problem • Rise of JSON • python module: simplejson
A really generic and unofficial definition of NoSQL An ill-defined set of mostly open-source databases, mostly developed in the 21st century, and mostly not using SQL.
Common Characteristics of NoSQL Databases • Non – relational • Open source • Cluster friendly • Built from the ground up to handle 21st century data challenges • Schema-less*
What is an aggregate? • Not what you think. • Definition from Domain Driven Design • “A group of related entities and value objects.” • aggregate = document
What is a document? • Not what you think. • word document <> NoSQL document
Example of a document • { • "business_id": "rncjoVoEFUJGCUoC1JgnUA", • "full_address": "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345", • "open": true, • "categories": ["Accountants", "Professional Services", "Tax Services",], • "city": "Peoria", • "review_count": 3, • "name": "Peoria Income Tax Service", • "neighborhoods": [], • "longitude": -112.241596, • "state": "AZ", • "stars": 5.0, • "latitude": 33.581867000000003, • "type": "business“ • }
Facts about MongoDB that will BLOW YOUR MIND!! • No Schemas • No transactions • No joins • Max docuement size of 16MB • Larger documents handled with GridFS
Facts about MongoDB that are fairly mundane. • Runs on most common OSs • Windows • Linux • Mac • Solaris • Data stored as BSON (Binary JSON) • used for speed • translation handled by language drivers
Rules for building NoSQL Data Structures Rule 1: Every document must have an _id. Rule 2: There is only one rule.
Designing NoSQL Data Structures • NoSQL data structures driven by application design. • Need to take into account necessary CRUD operations • To embed or not to imbed. That is the question! • Rule of thumb is to imbed whenever possible. • No modeling standards or CASE tools!
A (denormalized) embedded structure An array of values { "business_id": "rncjoVoEFUJGCUoC1JgnUA", "full_address": "8466 W Peoria Ave\nSte 6\nPeoria, AZ 85345", "open": true, "categories": ["Accountants", "Professional Services", "Tax Services",], "city": "Peoria", "review_count": 3, "name": "Peoria Income Tax Service", "neighborhoods": [], "longitude": -112.241596, "state": "AZ", "stars": 5.0, "latitude": 33.581867000000003, "type": "business“ }
A (denormalized) embedded structure An array of sub documents { “_id : “First Post”, “comments” : [ {“author” : “Bob”, “text” : “Nice Post!”}, {“author” : “Tom”, “text” : “Dislike!”} ], “comment_count” : 2 } This makes for a hairy query!
A normalized structure //db.post schema { “_id” : “First Post”, “author” : “Rick”, “text” : “This is my first post.” } //db.comments schema { “_id” : ObjectID(...), “post_id” : “First Post”, “author” : “Bob”, “text” : “Nice Post!” }
A polymorphic structure • When all the documents in a collection are similarly, but not identically, structured. • Enables simpler schema migration. • custom_field_1 • no more of this crap • Better mapping of object – oriented inheritance and polymorphism.
A polymorphic structure //Page document (stored in nodes collection) { _id : 1, title: “Welcome”, url: “/”, type: “page”, text: “Welcome to my wonderful wiki.” } //Photo document (also stored in nodes collection) { _id: 3, title: “Cool Photo”, url: “/photo.jpg”, type: “photo”, content: Binary(...) }
RmongoDB • Two packages available • Rmongo = Dodge Omni • rmongoDB = Porche • RmongoDB usage example
Final Thoughts • Data Architects should NOT be designing NoSQL data structures • Are NoSQL DBs going to totally replace RDBMS? • Polyglot Persistence
Questions? • You should consider this presentation a book report. • I’ve only been studying this stuff for a month. • I MIGHT have an answer to your question. • I might not...