Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B

Big Data Open Source Software and ProjectsABDS in Summary XIV: Level 14B I590 Data Science Curriculum August 15 2014 Geoffrey Fox gcf@indiana.edu http://www.infomall.org School of Informatics and Computing Digital Science Center Indiana University Bloomington

HPC-ABDS Layers • Message Protocols • Distributed Coordination: • Security & Privacy: • Monitoring: • IaaSManagement from HPC to hypervisors: • DevOps: • Interoperability: • File systems: • Cluster Resource Management: • Data Transport: • SQL / NoSQL / File management: • In-memory databases&caches / Object-relational mapping / Extraction Tools • Inter process communication Collectives, point-to-point, publish-subscribe • Basic Programming model and runtime, SPMD, Streaming, MapReduce, MPI: • High level Programming: • Application and Analytics: • Workflow-Orchestration: Here are 17 functionalities. Technologies are presented in this order 4 Cross cutting at top 13 in order of layered diagram starting at bottom

Apache Storm • https://storm.incubator.apache.org/ • Apache Storm is a distributed real time computation framework for processing streaming data. • Storm is being used to do real time analytics, online machine learning, distributed RPC etc. • Provides scalable, fault tolerant and guaranteed message processing. • Trident is a high level API on top of Storm which provides functions like stream joins, groupings, filters etc. Also Trident has exactly-once processing guarantees. • The project was originally developed at Twitter for processing Tweets from users and was donated to ASF in 2013. • Storm has being used in very large deployments in Fortune 500 companies like Twitter and Yahoo.

Apache Samza (LinkedIn) • http://samza.incubator.apache.org/ • Similar to Apache Storm, Apache Samza is a distributed real time computation framework for processing streaming data. • Apache Samza is built on top of Apache Kafka and Apache Yarn. Samza uses Kafka as its messaging layer and Yarn for managing the cluster of nodes with Samza processes. • Samza is scalable, fault tolerant and provides guaranteed message processing. • Samza was originally developed at LinkedIn and was donated to ASF in 2013

Apache S4 • http://incubator.apache.org/s4/ • Apache S4 is a distributed real time computation framework for processing unbounded streams of data. • Unlike Storm and Samza S4 provides a key value based system for processing data • The system is scalable, fault tolerant and provides guaranteed message processing. • S4 was originally developed at Yahoo and was donated to ASF in 2011 • S4 isn’t popular as Apache Storm

Databus (LinkedIn) • Closed source Databushttp://data.linkedin.com/projects/databus • Databus provides a timeline-consistent stream of change capture events for a database. It enables applications to watch a database, view and process updates in near real-time. • Databusprovides a complete after-image of every new/changed record as well as deletes, while maintaining timeline consistency and transactional boundaries. • The application integration is decoupled from the source database, and each application integration is isolated, which allows for parallel development and rapid innovation. • Databushas a few key parts: • a database connector to watch changes and maintain a clock or sequence value • an in-memory relay that keeps recent changes for efficient retrieval • a bootstrap service/database that enables long lookback queries (including from the beginning of time) • a client that provides a simple API to get changes since a point in time • To use databus, the consuming application simply maintains a high watermark, and periodically requests all changes since that point in time using the Databus client. Each consuming application maintains its own high watermark, which provides isolation from one another

Google MillWheel • http://research.google.com/pubs/pub41378.html • MillWheel is a distributed real time computation framework by Google. • Provides scalable, fault tolerant and exactly once message processing guarantees. • The key data abstraction of the MillWheel is Key-Value pairs and data is processed in a directed acyclic graph where nodes are the computation nodes. • The project is not open source and is planned to be available to general public through Google Cloud platform as a SaaS. • Similar functionality to Apache Storm • Part of Google Cloud Dataflow http://googlecloudplatform.blogspot.com/2014/06/sneak-peek-google-cloud-dataflow-a-cloud-native-data-processing-service.html that also has Google Pub-Sub and FlumeJava • See Amazon Kinesis http://aws.amazon.com/kinesis/ which combines Pub-Sub and Apache Storm capabilities

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B

Big Data Open Source Software and Projects ABDS in Summary XIV: Level 14B

Presentation Transcript

Open Source Software

Open Source Software

Building Scalable Big Data Infrastructure Using Open Source Software

Oracle in Open Source Projects

Open Source Software

Open Source Software

Open Source Software

Big Data Open Source Software and Projects ABDS in Summary XV: Level 15

Big Data Open Source Software and Projects ABDS in Summary XIII: Level 14A

Big Data Open Source Software and Projects ABDS in Summary IV: Level 7

Big Data Open Source Software and Projects Introduction

Big Data Open Source Software and Projects Aspects of Big Data Applications

Big Data Open Source Software and Projects ABDS in Summary VI: Level 9

Open Source Software

Big Data Open Source Software and Projects ABDS in Summary IX: Level 11B

Open Source Software

Building Scalable Big Data Infrastructure Using Open Source Software