Exploring NoSQL and Hadoop Systems: A Comprehensive Overview

NoSQL systems • Hadoop • Hbase • Hypertable • MongoDB • Redis • Cassandra • ElasticSearch • CouchDB • Accumulo • OrientDB • Neo4j • Etc. Etc. Etc. http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis (c) Ian Davis

Apache Hadoop (Distributed File System) • Open Source written in JAVA • Replication / Load Balance / multiple servers • Replicates code execution at each server • Map (Bit like SQL grouping) • Select, Transform, Distribute to servers for reducing • Reduce • Each server aggregates/reduces its own data • Results from one or more servers merged • (Bit like SQL aggregation) (c) Ian Davis

Map / Reduce • Code written in Java • Shipped and run on every data provider • Map • Compute from each data record a key • Might be identity function • Sort data by generated key on each machine • Reduce • Compute a summary value for each key • Then reduce to summary value across machines (c) Ian Davis

Advantages/Disadvantages • Advantages • Divide and Conquer (speeds massive data searches) • Highly parallel and scalable (execution on local data) • Robust / Reentrant (node execution can be restarted) • Replication (can use first past the post response) • Disadvantage • Must write Java Map and Reduce Logic from scratch • Still have to wait for all parallel activity to complete • Not a database solution • But Hbase & Hypertable built on top of Hadoop are (c) Ian Davis

MongoDB (NoSQL) • Distributed database system • Object Oriented DB • Each record encodes Binary JSON • JSON is a way of describing nested structure/content • No Meta Schema • Employed by Java, Javascript, PhP, etc • Indexed searching supported • Replication / Load Balancing / Sharding (c) Ian Davis

Advantages / Disadvantages • Advantages • Uses JSON while still supports transactions • Uses Map Reduce like Hadoop • Disadvantages • Custom query language • Complexity of stored data structures • Not as flexible as RDBMS • No clear meta schema • Risk of storing erroneous data (c) Ian Davis

http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis (c) Ian Davis

Redis • Main memory storage • Pros • Supports both fast memory read and update • Can store objects referenced by name • Can be distributed across many machines • Cons • Database must fit in memory • Recovery is by periodically backup of changes to a disk log (c) Ian Davis

(c) Ian Davis

Cassandra • Pros • Data duplication across nodes • Improves reliability and recovery • Row content can be distributed across nodes • Parallel searching of row content • Fast Map/Reduce capabilities • Cons • Record Read/write grows linearly with distribution • Doesn’t support relational joins etc. (c) Ian Davis

(c) Ian Davis

ElasticSearch • Pros • Can search HTML text • Supports indexing of that text • Supports distribution of the indices • Cons • Lacks distributed transactions • If documents change indices must be rebuilt? (c) Ian Davis

(c) Ian Davis

CouchDB • Databases are documents • Pros • Documents can change over time • Documents may be offline • Changes are timestamped • Atomic, consistent, isolation, durability (ACID) • Cons • Applications responsible for enforcing ACID (c) Ian Davis

(c) Ian Davis

Accumulo • Built on top of Hadoop • Pros • Big table store • Different security levels for column info • Map Reduce functionality • Cons • Uncertain future (c) Ian Davis

(c) Ian Davis

Other data mining strategies • Genetic Programming • Try to find random formula’s that signal something interesting • Gradient Descent • Construct a cost formula for how close answer is to desired answer (0 = perfect) • Find optimal parameterization • Regression – Fourier Transform • Prediction of future patterns (c) Ian Davis

Methodology • Filter • Clean up / sanitize the data (optional) • Train • Optimise logic on given set of data with given ground truth (achieve answer ~= ground truth) • Validate • Prove that not over-fitting or under-fitting by testing on a second data set (want to see similar accuracy) • Test • See if the results are good on a third data set • Use (c) Ian Davis

https://en.wikipedia.org/wiki/CAP_theorem (c) Ian Davis

The Basic Problem • A distributed system can become partitioned • Partitioned subsystems can’t communicate • More generally this issue is about latency • Consistency → Read matches most recent write • Availability → Can read from available node • But can’t know if partition has latest write • Don’t worry about talking to other partition → Availability • Either error or wait for other partition to respond → Consistency (c) Ian Davis

Basic Proof • Consider two partitioned nodes • Write new record to one node • Then read this same record from the other node • If partitioned other node can’t know latest version of record to be read • So either return version earlier than current • Or wait at this node to obtain latest version • Or insist both nodes must remain up all the time • System not then partition tolerant (c) Ian Davis

Horses for Courses (c) Ian Davis

(c) Ian Davis

Process Control Architectures (c) Ian Davis

Process-Control Style • Suitable for applications whose purpose is to maintain specified properties of the outputs of the process at (sufficiently near) given reference values. • Components: • Process Definition includes mechanisms for manipulating some process variables. • Control Algorithm for deciding how to manipulate process variables. (c) Ian Davis

Process-Control Style (Cont’d) • Connectors: are the data flow relations for: • Process Variables: • Input variable that measures an input to the process. • Manipulated variable whose value can be changed by the controller. • Controlled variable whose value the system is intended to control. • Set Point is the desired value for a controlled variable. • Sensors to obtain values of process variables pertinent to control. (c) Ian Davis

Open-Loop Control System(Non-feedback System) • Information about process variables is not used to adjust the system. (c) Ian Davis

Lack of feedback problems • No regulation of software behaviour • Fixed behaviour in response to change request • Disturbances in behaviour not considered • No check that actual close to desired (c) Ian Davis

Feed-Back Control System • The controlled variable is measured and the result is used to manipulate one or more of the process variables. (c) Ian Davis

Benefits • Self checking • Self improving • Example: • Prediction • Monitor accuracy of prediction • Employ mixture of experts (c) Ian Davis

Process Control Examples • Real-Time System Software to Control: • Automobile Anti-Lock Brakes • Nuclear Power Plants • Automobile Cruise-Control (c) Ian Davis

Process Control Examples • Hardware circuits that implement clocks, count, add etc. • Logic circuits • Can employ feedback • Quantum circuits • (Open Loop – No feedback) • Completely deterministic internally • But subject to significant problems with error (c) Ian Davis

Rule Based Architectures (c) Ian Davis

Iterative enhancement style • Want appearance of intelligent behaviour. • Impossible to quantify what intelligence is. • Start by writing a very dumb program. • Keep adding logic which makes it less dumb. • Terminate when can’t improve behaviour of resulting logic. (c) Ian Davis

Iterative enhancement pro/cons • Allows concurrent design and development. • Can lead to surprising intelligence. • Displays the same characteristics as human intelligence.. Rather unpredictable and not always right. • Very hard to predict apriori how successful exercise will be. (c) Ian Davis

Iterative enhancement example • Bridge program.. • Deal hand • Enforce basic rules of play • Add sensible rules for how to play well • Consider making finesses etc. etc. • Logic identifies the least worse card to play based on huge number of empirical rules drawn from observation of codes prior behaviour. • Release code when changes do not improve play (c) Ian Davis

Cribbage • Table driven discard logic • For every pair of cards I can discard • And for every cut • What is average value of my resulting hand • What is average value of the box • Add if my box else subtract since opponents • Maximize score (c) Ian Davis

Pegging • For each card I might still play • For every possible card played in response • And for each card I might then play • Play the card that maximizes my score minus their score. (c) Ian Davis

Model Driven Architectures (c) Ian Davis

Model driven architecture • Architecture is formalised in a model • Model is computer readable • A computer tool produces the implementation • Pro’s • Consistent mapping from design to implementation • Efficient approach to software development • Con’s • Limited by power of the model and translation tool • Hard to test, maintain, and very hard to debug (c) Ian Davis

Model driven Examples • RPCgen / DCOM • Marshalling/unmarshalling interfaces • Object Management Group (OMG) • Given UML produce code • Liqui|> • Given quantum machine expressed in F# • Generate code to simulate this machine • If you imagine compiling rather than interpreting (c) Ian Davis

Other Architectures (c) Ian Davis

C2 (Proposed by Taylor et. al.) • Architecture built using components & connectors • Both have a top and a bottom • Top and bottom of component has ≤ 1 connector • Opposite ends of component/connector connect • Other end of connector connects to anything • Behaves as if connected to a bus or broadcasting • All communication is via messages • Notifications reach components via implicit invocation • No knowledge of architecture below you https://users.soe.ucsc.edu/~ejw/papers/c2-tse.pdf (c) Ian Davis

(c) Ian Davis

Requests travel up • They may receive no reply (no reply may be possible) • Notifications travel down • These are notification of state change / results • Received by implicit invocation • Connectors are responsible for: • Routing, broadcasting, message filtering • Components register to receive notifications • Messages can be prioritized • Requests can be ignored (functionality not available) • Domain translation when necessary (c) Ian Davis

Objectives of C2 Architecture • Substrate independence • Above us this is achieved by domain translation • Below us this is achieved by information hiding • This fosters substitutability/reusability • Message based architecture • This permits parallelism, distribution, scaling • Different components share no state • Components are compartmentalized • They have no need to know architecture (c) Ian Davis

Aspect oriented architecture • Cross cutting concerns • Runtime assertions • Tracing (event logging) • Profiling (performance issues) • Error handling • Testing • Debugging • Reliability (buffer overflow etc) • Encryption mechanisms (c) Ian Davis

Implementation • Extend the language to specify aspects • So: no need to embed code everywhere • Can be implemented statically during compilation • And/or dynamically at runtime via request • Consider debugging C++ • The –g option statically introduces symbol tables • The break command in GDB dynamically alters code so that it is interrupted • The watch command dynamically watches variables (c) Ian Davis

Exploring NoSQL and Hadoop Systems: A Comprehensive Overview

Exploring NoSQL and Hadoop Systems: A Comprehensive Overview

Presentation Transcript

NoSQL

NoSQL DBs

NoSQL

NoSQL Systems

NoSQL

NoSQL and NOSQL

NOSQL

NoSql databases

NoSQL Databases

NoSQL

NoSQL

NoSQL

NoSQL Databases