840 likes | 1.17k Views
NoSQL and NOSQL. Beginning with o-r databases. First, CRUD…. Global notions of managing persistent data, regardless of the model or system Create, Read, Update, Delete But there is also DDL And there are implementation issues, like sorts and indices. Why not standard tables?.
E N D
NoSQL and NOSQL Beginning with o-r databases
First, CRUD… • Global notions of managing persistent data, regardless of the model or system • Create, Read, Update, Delete • But there is also DDL • And there are implementation issues, like sorts and indices
Why not standard tables? • Extreme data structuring conflict between host language and database language: • Impedance mismatch • Atomic values are the only common data type • To retrieve all of an object requires lots of joins • Difficult to look for objects that are similar but have some different attributes • Difficult to retrieve an attribute that is a collection • You have to program with two programming languages • We have value-based semantics – difficult to know if two people have the same mother, or just a mother with the same name/ID and this causes us to make inference
What is o-o? • Relation is a set of tuples • Objects are arranged in sets of objects • In a relation, a tuple’s components are primitive (int, string) • The components of an object can be complex types (sets, tuples, other objects) • SQL: programs are global • Object: programs are local
Key concept: Object Id’s • Every object has a unique Id: different objects have different Ids • Immutable: does not change as the object changes • Different from primary key! • Like a key, identifies an object uniquely • But key values can change – oidscannot • And there are inferences based on values
Objects and Values • An object is a pair: (oid, value) • Example: A Joe Public’s object (#32, [ SSN: 111-22-3333, Name: “Joe Public”, PhoneN: {“516-123-4567”, “516-345-6789”}, Child: {#445, #73} ] )
Classes • Class: set of semantically similar objects (eg, people, students, cars, motorcycles) • A class has: • Type: describes common structure of all objects in the class (semantically similar objects are also structurally similar) • Method signatures: declarations of the operations that can be applied to all objects in the class. • Extent: the set of all objects in the class • Classes are organized in a class hierarchy • The extent of a class contains the extent of any of its subclasses
The ODMG Standard • ODMG 3.0 was released in 2000 • Includes the data model (more or less) • ODL: The object definition language • OQL: The object query language • A transaction specification mechanism • Language bindings: How to access an ODMG database from C++, Smalltalk, and Java (expect C# to be added to the mix)
Main Idea: Host Language = Data Language • Objects in the host language are mapped directly to database objects • Some objects in the host program are persistent.
Objects in SQL • Object-relational extension of SQL-92 • Includes the legacy relational model • SQLdatabase= a finite set of relations • relation = a set of tuples (extends legacy relations) OR a set of objects (completelynew) • object = (oid, tuple-value) • tuple = tuple-value • tuple-value = [Attr1: v1, …, Attrn: vn] • multiset-value = {v1, …, vn}
Path expressions SELECT T.Student.Name, T.Grade FROM TRANSCRIPT T WHERE T.Student.Address.Street = ‘Main St.’
PostgreSQL vs. MySQL • PostgreSQL is a generation newer • It has nice UDT capabilities • There are libraries of UDTs that can be imported and used • Both PostgreSQL and MySQL • Full text search • XML data types • To some degree free • MySQL • Never underestimate the value of a heavily understood piece of software • Lots of stacks and development environments come configured to work with it (but to a lesser extent, this is true of PostgreSQL, too). • It is a “core” SQL database, in that we can move pretty much to any other server-based DBMS is we start with MySQL
Triggers in PostgreSQL • Triggers automatically fire stored procedures when some event happens, like an insert or update. They allow the database to enforce some required behavior in response to changing data. • PL/pgSQL – Procedural Language of PostgreSQL
Example CREATE TABLE logs ( event_idinteger, old_titlevarchar(255), old_startstimestamp, old_ends timestamp, logged_at timestamp DEFAULT current_timestamp ); A logs table
Continued CREATE OR REPLACE FUNCTION log_event() RETURNS trigger AS $$ DECLARE BEGIN INSERT INTO logs (event_id, old_title, old_starts, old_ends) VALUES (OLD.event_id, OLD.title, OLD.starts, OLD.ends); RAISE NOTICE 'Someone just changed event #%', OLD.event_id; RETURN NEW; END; A function to insert old data in to the log
Continued… a trigger CREATE TRIGGER log_events AFTER UPDATE ON events FOR EACH ROW EXECUTE PROCEDURE log_event(); Logs changes after any row is updated
Rules • A RULE is a description of how to alter the parsed query tree. • Every time Postgresruns an SQL statement, it parses the statement into a query tree (generally called an abstract syntax tree).
Back to PostgreSQL Page 32 of Seven Databases, onward Fuzzy searches, full text searching
Postgres and spatial data • For manipulating 2D/3D spatial data • Points, lines, and polygons formed from points and lines • Can perform union, intersection, operations • Can project shapes into 2D areas • Has a 3D geometry type (relatively new) • Can calculate accurate distances in meters • Works with an open source server that allows folks to share geospatial data • Command line interface • Also supports some forms of raster data • Provides spatial indices • Has a notion of a geometric column
Queries SELECT superhero.name FROM city, superhero WHEREST_Contains(city.geom, superhero.geom) and city.name = 'Gotham'; SELECTAsBinary(the_geom) as wkb_geometryFROMriver AS r, state AS s WHERE intersects(r.the_geom, s.the_geom)
Mapnik • Used for OSM (open street map) data and uses postgis • Mapnik is an open source system for rendering maps • Used to design maps • Written in C++ • It renders maps from postgis databases
Next: full text and approximate text search • But first, not to be confused with the Like operator • Used % as the wild card • Or with regular expressions for character string comparison
Full text search • First, you index the words in a document and create an array of lexemes • Second, specify a boolean phrase using and, or, not, and parens • We typically don’t index “stop” words like and, or, the, etc. • Dictionaries are used to find roots of related words, like dead and dying • Thesauruses dictionaries are used to for recognition of domain-specific and similar words
Documents • A document is a text attribute in a row of a table • Often we use part of a document or concatenate various parts of documents
Details: dictionaries • Define stop words that should not be indexed • Map synonyms to a single word. • Map phrases to a single word using a thesaurus. • Map different variations of a word to a canonical form
Searching • Uses a match operator - @@ • Basic search consists of asking about the relationship to a vector of words to a given document, which is also a vector • The vector can have and, or, etc. in it • tsvector – document – normalized lexemes • tsquery – query
Examples SELECT title FROM pgweb WHERE to_tsvector(title || ' ' || body) @@ to_tsquery('create & table') ORDER BY last_mod_date DESC LIMIT 10; select the ten most recent documents that contain create and table in the title or body Results can be ranked
Recent addition: fuzziness • soundex(text) returns text • Converts a string to its Soundexcode • Based on pronunciation • difference(text, text) returns int • converts two strings to their Soundex codes and then reports the number of matching code positions • 0 is a no match • 4 is a full match • Def: A phonetic coding system intended to suppress spelling variation and determining the relationship between two (similar) words
Levenshtein • Levenshtein distance is a metric for evaluating the difference between two sequences, in particular, words • E.g.: test=# SELECT levenshtein('GUMBO', 'GAMBOL'); • E.g.: SELECT * FROM some_table WHERE levenshtein(code, 'AB123-lHdfj') <= 3 ORDER BY levenshtein(code, 'AB123-lHdfj') LIMIT 10 • Used in particular, to detect nicknames
Metaphone • E.g., metaphone(text source, intmax_output_length) returns text • Similar to soundex • Used to classify words according to their english pronunciation • Apparently better for non-english languages, compared to soundex • E.g.: SELECT * FROM users WHERE METAPHONE(users.first_name, 2) = METAPHONE('Willem', 2) should detect similarity to word William
Class project, beginning • Build an application on top of • PostgreSQL or MySQL • And one of the other NoSQL databases in the 3 books • But pick from only key/value, key/document, or column-based databases • The application is written in a language of your choice • Each of the databases must be used to manage the kind of data it is intended for • Traditional relational table data • And nontraditional data
Final grades • Choice 1: each of exam 1, exam 2, project are 1/3 of your final grade • Choice 2: if you build a web app for your project, I will use the best two of three grades
NoSQL DBs • Why?
Relational DBs • SQL - Fixed schema, row oriented & optimized • SQL - Rigid 2Phase transactions, with locking bottleneck • SQL - Set theoretic • SQL - Centralized distribution • SQL - Computational, not navigational/inter-connected & set-oriented • Sql - Poor support for heterogeneity & compression
No SQL - no or not only • Column-oriented - HBase (uses column families and no schema, has versioning and consistence transactions) • Key/value pairs - Google Dynamo • Graph like - Neo4J • Document based - MongoDB (cluster based for huge scale, supports nested docs, and uses JavaScript for queries, and no schema)
But remember - • Categories not distinct - take each one for what it is • Heterogeneous structure & polyglot language environment is common • NoSQL DBs tend to be unsupported with funky GUIs - but there are very active volunteer user bases maintaining and evolving them • NoSQL DBs also tend to use programming languages for queries
When do you want non-2P transactions and no SQL? • Interactive, zillion user apps where user fixes errors via some form of compensation • Minimal interconnectedness • Individual data values are not mission-critical • Read-heavy environments • Cloud -based environments • Queries are not set-oriented & are computational and imperative, and perhaps long • Real time apps
SQL is here to stay... • Formal & unambiguous semantics • Declarative language with clean separation of application and queries • Consistent • Flexible • Black boxed, tested, and supported - and very well understood with many thousands of trained programmers - SQL is a basic language, like Java, Javascript, PHP, C#. etc. • Great GUIs that are very rich and debugged
And importantly... • Lots of apps need clean, well understood stacks, not speed or the cloud • In particular, websites that do retail business need consistent transactions and do not need the speed that comes with delayed updates • Relational DBs scale reasonably well, too, at least in non-cloud environments
Again… • The classification of the various nosql databases is imprecise, semi-controversial, and we have to be careful about reading too much into it. • Rather than focusing on categorizing dbs, we should be concerned with what they do, how they relate to each other with respect to functionality, and how they compare to sql databases.
Key-value and key-document DBs • Databases that access aggregate data • Key-value dbs know nothing about the structure of the aggregate • Key-document databases do know, but the interpretation of these aggregates happens outside the db • Keep in mind that these two categories of databases overlap in practice • Importantly, both of these two database systems categories focus on storing and retrieving individual aggregates, and not on interrelating (horizontally) multiple aggregates • There is something similar to this in SQL DBs – and that is highly un-normalized tables
Important notions… • It can be a difficult problem to represent some domains as key-value or key-document databases, as the boundaries of aggregates might not be easy to determine. • This basic data modeling issue has a lot of influence on the sort of database you should use. • Relational databases don’t manipulate aggregates, but they are aggregate neutral for the most part, leaving the construction of aggregates to run time … but we might have hidden, un-normalized tables that make some commonly used aggregates much faster to materialize
Key-value vs. key-document • In key-value databases, we can only retrieve data via a key • In key-document databases, we may be able to ask questions about the content of documents – but again, we are not cross-associating them • Mongo is perhaps the most talked about key-document system, and so we will start there
Installing Mongo • Mongo • http://docs.mongodb.org/manual/installation • A GUI • http://www.mongodb.org/display/DOCS/Admin+UIs
GUIs for Mongo • There are a few GUIs that seem pretty good • Mongo-vision: http://code.google.com/p/mongo-vision/ (web page) • Needs Prudence as a web server • MongoVue: http://mongovue.com, but Windows only • RockMongo (web based): http://rockmongo.com/ (web page) • Needs an apache web server • Very easy to install, just download • http://docs.mongodb.org/manual/installation
Getting an Apache web server • XAMPP for windows (mac version is way out of date) • MAMP for Macs (on the app store) • WAMP for windows (bitnami.org) • All of these give you PHP and MySQL as well. If we have time, we will look at MySQL full text search. • You might want to install PostgreSQL, too. There is a bitnami stack. If there is time, we will look at PostgreSQL UDTs and full text search.
Mongo overview • Document based • Focuses on clusters for extremely large scaling • Supports nested documents • Uses JavaScript for queries • No schema
Terminology • A database consists of collections • Collections are made up of documents • A document is made up of fields • There are also indices • There are also cursors
When to use Mongo • Medical records and other large document systems • Read heavy environments like analytics and mining • Partnered with relational databases • Relational for live data • Mongo for huge largely read only archives • Online applications • Massively wide e-commerce