300 likes | 480 Views
Data Formats. CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook. Agenda. Apache Avro Parquet. Apache Avro. Overview. Avro is a data serialization system Implemented in C, C++, C#, Java, PHP, Python, and Ruby. Avro Provides. Rich data structures
E N D
Data Formats CMSC 491/691 Hadoop-Based Distributed Computing Spring 2014 Adam Shook
Agenda • Apache Avro • Parquet
Overview • Avro is a data serialization system • Implemented in C, C++, C#, Java, PHP, Python, and Ruby
Avro Provides • Rich data structures • Compact, fast, binary data format • A container file to store persistent data • Remote Procedure Call (RPC) • Simple integration with dynamic languages
Schema Declaration • A JSON string • A JSON object • {"type": "typeName"...attributes...} • A JSON array, representing a union of types
Primitive Types • Null • Boolean • Int • Long • Float • Double • Bytes • String
Complex Types • Records • Enums • Arrays • Maps • Unions • Fixed
Record Example - LinkedList { "type": "record", "name": "LongList", "aliases": ["LinkedLongs"], // old name for this "fields" : [ {"name": "value", "type": "long"}, // each element has a long {"name": "next", "type": ["LongList", "null"]} // optional next element ] }
Enum Example – Playing Cards { "type": "enum", "name": "Suit", "symbols" : ["SPADES", "HEARTS", "DIAMONDS", "CLUBS"] }
Array { "type": "array", "items": "string" }
Maps { "type": "map", "values": "long" }
Unions • Represented using JSON arrays • ["string", "null"] declares a schema which may be a string or null • May not contain more than one schema with the same type, except in the case of named types like record, fixed, and enum. • Two arrays or maps? No. But two record types? Yes! • Cannot contain other unions
Fixed { "type": "fixed", "size": 16, "name": "md5" }
A bit on Naming • Records, enums, and fixed types are all named • The full name is composed of the name and a namespace • Names start with [A-Za-z_] and can only contain [A-Za-z0-9_] • Namespaces are dot-separated sequence of names • Named types can be aliased to map a writer’s schema to a reader
Encodings! • Binary • JSON • One is more readable by the machines, one is more readable by the humans • Details of how they are encoded can be found at http://avro.apache.org/docs/current/spec.html
Compression • Null • Deflate • Snappy (optional)
Other Features • RPC via Protocols • Message passing between readers and writers • Schema Resolution • When schema and data don’t align • Parsing Canonical Form • Transform schemas into PCF to determine “sameness” between schemas • Schema Fingerprints • To “uniquely” identify schemas
Code Generation! [shadam1@491vm ~]$ cat user.avsc { "namespace": "example.avro", "type": "record", "name": "User", "fields": [ {"name": "name", "type": "string"}, {"name": "favorite_number", "type": ["int", "null"]}, {"name": "favorite_color", "type": ["string", "null"]} ] }
Code Generation! [shadam1@491vm ~]$ java -jar avro-tools-1.7.6.jar compile \ schema user.avsc. Input files to compile: user.avsc [shadam1@491vm ~]$ vi example/avro/User.java
Java and Python Demo! • See my VM
Overview • Parquet is an open-source columnar storage format for Hadoop • Based off the Google Dremel paper and created largely by Twitter and Cloudera • Supports very efficient compression and encoding schemes
Serialization • Objects are serialized to Parquet format by ReadSupport and WriteSupport implementations • Support for Avro, Thrift, Pig, Hive SerDe, MapReduce • Can write your own, but it’s easier to leverage what exists today
File Hierarchy • Row Group – logical horizontal partitioning of data into rows • Column Chunk – Chunk of the data for a particular column, living in a row group and contiguous in the file • Page – Chunks are divided up into pages • One or more Row Groups per file, exactly one Column Chunk per column
File Format 4-byte magic number "PAR1 <Column 1 Chunk 1 + Column Metadata> <Column 2 Chunk 1 + Column Metadata> ... <Column N Chunk 1 + Column Metadata> <Column 1 Chunk 2 + Column Metadata> <Column 2 Chunk 2 + Column Metadata> ... <Column N Chunk 2 + Column Metadata> ... <Column 1 Chunk M + Column Metadata> <Column 2 Chunk M + Column Metadata> ... <Column N Chunk M + Column Metadata> File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1"
File Format 4-byte magic number "PAR1" <Column 1 Chunk 1 + Column Metadata> <Column 2 Chunk 1 + Column Metadata> ... <Column N Chunk 1 + Column Metadata> <Column 1 Chunk 2 + Column Metadata> <Column 2 Chunk 2 + Column Metadata> ... <Column N Chunk 2 + Column Metadata> ... <Column 1 Chunk M + Column Metadata> <Column 2 Chunk M + Column Metadata> ... <Column N Chunk M + Column Metadata> File Metadata 4-byte length in bytes of file metadata 4-byte magic number "PAR1"
Data Types • Boolean • Int 32, 64, 96 • Float • Double • Byte Array
Parquet Example - Avro • See my VM
References • http://avro.apache.org • http://parquet.io