Data Formats

Data Formats CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook

Agenda • Apache Avro • Parquet

Apache Avro

Overview • Avro is a data serialization system • Implemented in C, C++, C#, Java, PHP, Python, and Ruby

Avro Provides • Rich data structures • Compact, fast, binary data format • A container file to store persistent data • Remote Procedure Call (RPC) • Simple integration with dynamic languages

Schema Declaration • A JSON string • A JSON object • {"type": "typeName"...attributes...} • A JSON array, representing a union of types

Primitive Types • Null • Boolean • Int • Long • Float • Double • Bytes • String

Complex Types • Records • Enums • Arrays • Maps • Unions • Fixed

Record Example - LinkedList { "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], // old name for this "fields" : [ {"name": "value", "type": "long"}, // each element has a long {"name": "next", "type": ["LongList", "null"]} // optional next element ] }

Enum Example – Playing Cards { "type": "enum", "name": "Suit", "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"] }

Array { "type": "array", "items": "string" }

Maps { "type": "map", "values": "long" }

Unions • Represented using JSON arrays • ["string", "null"] declares a schema which may be a string or null • May not contain more than one schema with the same type, except in the case of named types like record, fixed, and enum. • Two arrays or maps? No. But two record types? Yes! • Cannot contain other unions

Fixed { "type": "fixed", "size": 16, "name": "md5" }

A bit on Naming • Records, enums, and fixed types are all named • The full name is composed of the name and a namespace • Names start with [A-Za-z_] and can only contain [A-Za-z0-9_] • Namespaces are dot-separated sequence of names • Named types can be aliased to map a writer’s schema to a reader

Encodings! • Binary • JSON • One is more readable by the machines, one is more readable by the humans • Details of how they are encoded can be found at http://avro.apache.org/docs/current/spec.html

Compression • Null • Deflate • Snappy (optional)

Other Features • RPC via Protocols • Message passing between readers and writers • Schema Resolution • When schema and data don’t align • Parsing Canonical Form • Transform schemas into PCF to determine “sameness” between schemas • Schema Fingerprints • To “uniquely” identify schemas

Code Generation! [shadam1@491vm ~]$ cat user.avsc { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }

Code Generation! [shadam1@491vm ~]$ java -jar avro-tools-1.7.6.jar compile \ schema user.avsc. Input files to compile: user.avsc [shadam1@491vm ~]$ vi example/avro/User.java

Java and Python Demo! • See my VM

Parquet

Overview • Parquet is an open-source columnar storage format for Hadoop • Based off the Google Dremel paper and created largely by Twitter and Cloudera • Supports very efficient compression and encoding schemes

Serialization • Objects are serialized to Parquet format by ReadSupport and WriteSupport implementations • Support for Avro, Thrift, Pig, Hive SerDe, MapReduce • Can write your own, but it’s easier to leverage what exists today

File Hierarchy • Row Group – logical horizontal partitioning of data into rows • Column Chunk – Chunk of the data for a particular column, living in a row group and contiguous in the file • Page – Chunks are divided up into pages • One or more Row Groups per file, exactly one Column Chunk per column

File Format 4-byte magic number "PAR1 <Column 1 Chunk 1 + Column Metadata> <Column 2 Chunk 1 + Column Metadata> ... <Column N Chunk 1 + Column Metadata> <Column 1 Chunk 2 + Column Metadata> <Column 2 Chunk 2 + Column Metadata> ... <Column N Chunk 2 + Column Metadata> ... <Column 1 Chunk M + Column Metadata> <Column 2 Chunk M + Column Metadata> ... <Column N Chunk M + Column Metadata> File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1"

File Format 4-byte magic number "PAR1" <Column 1 Chunk 1 + Column Metadata> <Column 2 Chunk 1 + Column Metadata> ... <Column N Chunk 1 + Column Metadata> <Column 1 Chunk 2 + Column Metadata> <Column 2 Chunk 2 + Column Metadata> ... <Column N Chunk 2 + Column Metadata> ... <Column 1 Chunk M + Column Metadata> <Column 2 Chunk M + Column Metadata> ... <Column N Chunk M + Column Metadata> File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1"

Data Types • Boolean • Int 32, 64, 96 • Float • Double • Byte Array

Parquet Example - Avro • See my VM

References • http://avro.apache.org • http://parquet.io

Data Formats

Data Formats

Presentation Transcript

NASA SATELLITE DATA FORMATS

Data Formats and Tools

EPC Data Formats

Data formats in Bioinformatics

Spatial Data Formats

Gretina data flow and formats

CHAPTER 3: Data Formats

MedDRA data as SAS formats

DATA FORMATS AT EOL

Data Formats and Codecs

Web data exchange formats

MTEM data formats

Data File Formats: netCDF

Data Formats and Codecs

Spatial Data Formats

Data Formats

Processing different formats of data

Other formats for data

Interoperable Data Formats

Data Formats and Codecs

CHAPTER 3: Data Formats

Data Storage Formats