1 / 49

NoSQL systems

NoSQL systems. Hadoop Hbase Hypertable MongoDB Redis Cassandra. ElasticSearch CouchDB Accumulo OrientDB Neo4j Etc. Etc. Etc. http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis. Apache Hadoop (Distributed File System). Open Source written in JAVA

brantj
Download Presentation

NoSQL systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NoSQL systems • Hadoop • Hbase • Hypertable • MongoDB • Redis • Cassandra • ElasticSearch • CouchDB • Accumulo • OrientDB • Neo4j • Etc. Etc. Etc. http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis (c) Ian Davis

  2. Apache Hadoop (Distributed File System) • Open Source written in JAVA • Replication / Load Balance / multiple servers • Replicates code execution at each server • Map (Bit like SQL grouping) • Select, Transform, Distribute to servers for reducing • Reduce • Each server aggregates/reduces its own data • Results from one or more servers merged • (Bit like SQL aggregation) (c) Ian Davis

  3. Map / Reduce • Code written in Java • Shipped and run on every data provider • Map • Compute from each data record a key • Might be identity function • Sort data by generated key on each machine • Reduce • Compute a summary value for each key • Then reduce to summary value across machines (c) Ian Davis

  4. Advantages/Disadvantages • Advantages • Divide and Conquer (speeds massive data searches) • Highly parallel and scalable (execution on local data) • Robust / Reentrant (node execution can be restarted) • Replication (can use first past the post response) • Disadvantage • Must write Java Map and Reduce Logic from scratch • Still have to wait for all parallel activity to complete • Not a database solution • But Hbase & Hypertable built on top of Hadoop are (c) Ian Davis

  5. MongoDB (NoSQL) • Distributed database system • Object Oriented DB • Each record encodes Binary JSON • JSON is a way of describing nested structure/content • No Meta Schema • Employed by Java, Javascript, PhP, etc • Indexed searching supported • Replication / Load Balancing / Sharding (c) Ian Davis

  6. Advantages / Disadvantages • Advantages • Uses JSON while still supports transactions • Uses Map Reduce like Hadoop • Disadvantages • Custom query language • Complexity of stored data structures • Not as flexible as RDBMS • No clear meta schema • Risk of storing erroneous data (c) Ian Davis

  7. http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis (c) Ian Davis

  8. Redis • Main memory storage • Pros • Supports both fast memory read and update • Can store objects referenced by name • Can be distributed across many machines • Cons • Database must fit in memory • Recovery is by periodically backup of changes to a disk log (c) Ian Davis

  9. (c) Ian Davis

  10. Cassandra • Pros • Data duplication across nodes • Improves reliability and recovery • Row content can be distributed across nodes • Parallel searching of row content • Fast Map/Reduce capabilities • Cons • Record Read/write grows linearly with distribution • Doesn’t support relational joins etc. (c) Ian Davis

  11. (c) Ian Davis

  12. ElasticSearch • Pros • Can search HTML text • Supports indexing of that text • Supports distribution of the indices • Cons • Lacks distributed transactions • If documents change indices must be rebuilt? (c) Ian Davis

  13. (c) Ian Davis

  14. CouchDB • Databases are documents • Pros • Documents can change over time • Documents may be offline • Changes are timestamped • Atomic, consistent, isolation, durability (ACID) • Cons • Applications responsible for enforcing ACID (c) Ian Davis

  15. (c) Ian Davis

  16. Accumulo • Built on top of Hadoop • Pros • Big table store • Different security levels for column info • Map Reduce functionality • Cons • Uncertain future (c) Ian Davis

  17. (c) Ian Davis

  18. Other data mining strategies • Genetic Programming • Try to find random formula’s that signal something interesting • Gradient Descent • Construct a cost formula for how close answer is to desired answer (0 = perfect) • Find optimal parameterization • Regression – Fourier Transform • Prediction of future patterns (c) Ian Davis

  19. Methodology • Filter • Clean up / sanitize the data (optional) • Train • Optimise logic on given set of data with given ground truth (achieve answer ~= ground truth) • Validate • Prove that not over-fitting or under-fitting by testing on a second data set (want to see similar accuracy) • Test • See if the results are good on a third data set • Use (c) Ian Davis

  20. https://en.wikipedia.org/wiki/CAP_theorem (c) Ian Davis

  21. The Basic Problem • A distributed system can become partitioned • Partitioned subsystems can’t communicate • More generally this issue is about latency • Consistency → Read matches most recent write • Availability → Can read from available node • But can’t know if partition has latest write • Don’t worry about talking to other partition → Availability • Either error or wait for other partition to respond → Consistency (c) Ian Davis

  22. Basic Proof • Consider two partitioned nodes • Write new record to one node • Then read this same record from the other node • If partitioned other node can’t know latest version of record to be read • So either return version earlier than current • Or wait at this node to obtain latest version • Or insist both nodes must remain up all the time • System not then partition tolerant (c) Ian Davis

  23. Horses for Courses (c) Ian Davis

  24. (c) Ian Davis

  25. Process Control Architectures (c) Ian Davis

  26. Process-Control Style • Suitable for applications whose purpose is to maintain specified properties of the outputs of the process at (sufficiently near) given reference values. • Components: • Process Definition includes mechanisms for manipulating some process variables. • Control Algorithm for deciding how to manipulate process variables. (c) Ian Davis

  27. Process-Control Style (Cont’d) • Connectors: are the data flow relations for: • Process Variables: • Input variable that measures an input to the process. • Manipulated variable whose value can be changed by the controller. • Controlled variable whose value the system is intended to control. • Set Point is the desired value for a controlled variable. • Sensors to obtain values of process variables pertinent to control. (c) Ian Davis

  28. Open-Loop Control System(Non-feedback System) • Information about process variables is not used to adjust the system. (c) Ian Davis

  29. Lack of feedback problems • No regulation of software behaviour • Fixed behaviour in response to change request • Disturbances in behaviour not considered • No check that actual close to desired (c) Ian Davis

  30. Feed-Back Control System • The controlled variable is measured and the result is used to manipulate one or more of the process variables. (c) Ian Davis

  31. Benefits • Self checking • Self improving • Example: • Prediction • Monitor accuracy of prediction • Employ mixture of experts (c) Ian Davis

  32. Process Control Examples • Real-Time System Software to Control: • Automobile Anti-Lock Brakes • Nuclear Power Plants • Automobile Cruise-Control (c) Ian Davis

  33. Process Control Examples • Hardware circuits that implement clocks, count, add etc. • Logic circuits • Can employ feedback • Quantum circuits • (Open Loop – No feedback) • Completely deterministic internally • But subject to significant problems with error (c) Ian Davis

  34. Rule Based Architectures (c) Ian Davis

  35. Iterative enhancement style • Want appearance of intelligent behaviour. • Impossible to quantify what intelligence is. • Start by writing a very dumb program. • Keep adding logic which makes it less dumb. • Terminate when can’t improve behaviour of resulting logic. (c) Ian Davis

  36. Iterative enhancement pro/cons • Allows concurrent design and development. • Can lead to surprising intelligence. • Displays the same characteristics as human intelligence.. Rather unpredictable and not always right. • Very hard to predict apriori how successful exercise will be. (c) Ian Davis

  37. Iterative enhancement example • Bridge program.. • Deal hand • Enforce basic rules of play • Add sensible rules for how to play well • Consider making finesses etc. etc. • Logic identifies the least worse card to play based on huge number of empirical rules drawn from observation of codes prior behaviour. • Release code when changes do not improve play (c) Ian Davis

  38. Cribbage • Table driven discard logic • For every pair of cards I can discard • And for every cut • What is average value of my resulting hand • What is average value of the box • Add if my box else subtract since opponents • Maximize score (c) Ian Davis

  39. Pegging • For each card I might still play • For every possible card played in response • And for each card I might then play • Play the card that maximizes my score minus their score. (c) Ian Davis

  40. Model Driven Architectures (c) Ian Davis

  41. Model driven architecture • Architecture is formalised in a model • Model is computer readable • A computer tool produces the implementation • Pro’s • Consistent mapping from design to implementation • Efficient approach to software development • Con’s • Limited by power of the model and translation tool • Hard to test, maintain, and very hard to debug (c) Ian Davis

  42. Model driven Examples • RPCgen / DCOM • Marshalling/unmarshalling interfaces • Object Management Group (OMG) • Given UML produce code • Liqui|> • Given quantum machine expressed in F# • Generate code to simulate this machine • If you imagine compiling rather than interpreting (c) Ian Davis

  43. Other Architectures (c) Ian Davis

  44. C2 (Proposed by Taylor et. al.) • Architecture built using components & connectors • Both have a top and a bottom • Top and bottom of component has ≤ 1 connector • Opposite ends of component/connector connect • Other end of connector connects to anything • Behaves as if connected to a bus or broadcasting • All communication is via messages • Notifications reach components via implicit invocation • No knowledge of architecture below you https://users.soe.ucsc.edu/~ejw/papers/c2-tse.pdf (c) Ian Davis

  45. (c) Ian Davis

  46. Requests travel up • They may receive no reply (no reply may be possible) • Notifications travel down • These are notification of state change / results • Received by implicit invocation • Connectors are responsible for: • Routing, broadcasting, message filtering • Components register to receive notifications • Messages can be prioritized • Requests can be ignored (functionality not available) • Domain translation when necessary (c) Ian Davis

  47. Objectives of C2 Architecture • Substrate independence • Above us this is achieved by domain translation • Below us this is achieved by information hiding • This fosters substitutability/reusability • Message based architecture • This permits parallelism, distribution, scaling • Different components share no state • Components are compartmentalized • They have no need to know architecture (c) Ian Davis

  48. Aspect oriented architecture • Cross cutting concerns • Runtime assertions • Tracing (event logging) • Profiling (performance issues) • Error handling • Testing • Debugging • Reliability (buffer overflow etc) • Encryption mechanisms (c) Ian Davis

  49. Implementation • Extend the language to specify aspects • So: no need to embed code everywhere • Can be implemented statically during compilation • And/or dynamically at runtime via request • Consider debugging C++ • The –g option statically introduces symbol tables • The break command in GDB dynamically alters code so that it is interrupted • The watch command dynamically watches variables (c) Ian Davis

More Related