110 likes | 124 Views
This presentation gives an overview of the Apache Arrow project. It explains the Arrow project in terms of its in memory structure, its purpose, language interfaces and supporting projects. <br> <br>Links for further information and connecting<br><br>http://www.amazon.com/Michael-Frampton/e/B00NIQDOOM/<br><br>https://nz.linkedin.com/pub/mike-frampton/20/630/385<br><br>https://open-source-systems.blogspot.com/
E N D
What Is Apache Arrow ? ● A development platform for in-memory data ● It has a columnar memory format ● It provides efficient analytic operations on modern hardware ● Used for in memory processing ● Cross language support ● Open source / Apache 2.0 license ● Supports zero-copy reads for lightning fast data access
Languages supported ● Arrow supports many languages ● C ● MATLAB ● C++ ● Python ● C# ● R ● Go ● Ruby ● Java ● Rust ● JavaScript
OS Community Support ● Many open source projects support Arrow ● Calcite ● Kudu ● Cassandra ● Pandas ● Drill ● Parquet ● Hadoop ● Phoenix ● HBase ● Spark ● Ibis ● Storm ● Impala
The problem Arrow tackles ● Each system has its own internal memory format ● 70-80% computation wasted – on serialization and de-serialization ● Similar functionality implemented in multiple projects ● Overheads for cross-system communication ● All systems utilize different memory formats
The problem Arrow tackles ● No shared in memory data model
Arrow solves this problem ● All systems utilize the same memory format – In memory – Columnar format – Optimized for modern CPUs and GPUs ● No overhead for cross-system communication ● Projects can share functionality
Arrow solves this problem ● Arrow shared data model
Arrow works with Parquet ● Arrow is an in memory format ● Parquet is designed for disk storage ● Arrow and Parquet are intended to be used together ● Parquet is a columnar file format ● Used for data serialization ● Parquet is a streaming format ● Data must be decoded from start-to-end ● Files are compressed and encoded ● Means smaller files on disk
Arrow Memory Buffer ● Arrow supports data adjacency for sequential access
Available Books ● See “Big Data Made Easy” Apress Jan 2015 – See “Mastering Apache Spark” ● Packt Oct 2015 – See “Complete Guide to Open Source Big Data Stack ● “Apress Jan 2018” – ● Find the author on Amazon www.amazon.com/Michael-Frampton/e/B00NIQDOOM/ – Connect on LinkedIn ● www.linkedin.com/in/mike-frampton-38563020 –
Connect ● Feel free to connect on LinkedIn –www.linkedin.com/in/mike-frampton-38563020 ● See my open source blog at open-source-systems.blogspot.com/ – ● I am always interested in – New technology – Opportunities – Technology based issues – Big data integration