Apache Arrow

What Is Apache Arrow ? ● A development platform for in-memory data ● It has a columnar memory format ● It provides efficient analytic operations on modern hardware ● Used for in memory processing ● Cross language support ● Open source / Apache 2.0 license ● Supports zero-copy reads for lightning fast data access

Languages supported ● Arrow supports many languages ● C ● MATLAB ● C++ ● Python ● C# ● R ● Go ● Ruby ● Java ● Rust ● JavaScript

OS Community Support ● Many open source projects support Arrow ● Calcite ● Kudu ● Cassandra ● Pandas ● Drill ● Parquet ● Hadoop ● Phoenix ● HBase ● Spark ● Ibis ● Storm ● Impala

The problem Arrow tackles ● Each system has its own internal memory format ● 70-80% computation wasted – on serialization and de-serialization ● Similar functionality implemented in multiple projects ● Overheads for cross-system communication ● All systems utilize different memory formats

The problem Arrow tackles ● No shared in memory data model

Arrow solves this problem ● All systems utilize the same memory format – In memory – Columnar format – Optimized for modern CPUs and GPUs ● No overhead for cross-system communication ● Projects can share functionality

Arrow solves this problem ● Arrow shared data model

Arrow works with Parquet ● Arrow is an in memory format ● Parquet is designed for disk storage ● Arrow and Parquet are intended to be used together ● Parquet is a columnar file format ● Used for data serialization ● Parquet is a streaming format ● Data must be decoded from start-to-end ● Files are compressed and encoded ● Means smaller files on disk

Arrow Memory Buffer ● Arrow supports data adjacency for sequential access

Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –

Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration

Apache Arrow

Apache Arrow

Presentation Transcript

Arrow pic

Apache Sandesha and Apache Axis2

Apache

Apache

Apache

Apache

Apache

Apache

ARROW

APACHE

An Arrow is an Arrow right?

Cupid’s Arrow

Time's Arrow

Apache

ARROW

Apache

Apache

APACHE

Apache

Apache

Apache

APACHE