150 likes | 448 Views
SQOOP HCatalog Integration. Venkat Ranganathan Sqoop Meetup 10/28/13. Agenda. HCatalog Overview Sqoop HCatalog integration Goals Features Demo Benefits. HCatalog Overview. Table and Storage Management Service for Hadoop Enables PIG/MR and Hive to more easily share data on the grid
E N D
SQOOP HCatalog Integration Venkat Ranganathan Sqoop Meetup 10/28/13
Agenda • HCatalog Overview • Sqoop HCatalog integration Goals • Features • Demo • Benefits
HCatalog Overview • Table and Storage Management Service for Hadoop • Enables PIG/MR and Hive to more easily share data on the grid • Uses the Hive Meta-store. • Abstracts location and format of the data • Supports reading and writing files in any format for which there is a Hive Serde available. • Now part of Hive.
Sqoop HCatalog Integration Goals • Support HCatalog features consistent with Sqoop usage. • Support both imports into and exports from HCatalog table • Enable Sqoop read and write data in various formats. • Automatic table schema mapping • Data fidelity • Support for static and dynamic partition keys
Support imports and exports • Allows the HCatalog tables to be either the source or destination of a Sqoop job. • In an HCatalog import, target-dir and warehouse-dir are replaced with the HCatalog table name. • Similarly for exports, the export directory is substituted with the HCatalog table name.
File format support • HCatalog integration into Sqoop now enables Sqoop to • Import/Export files of various formats that have hive serdecreated • Textfiles, Sequence files, RCFiles, ORCFile,… • This makes Sqoop agnostic of the file format used which can change over time based on new innovations/needs.
Automatic table schema mapping • Sqoop allows a hive table to be created based on the enterprise data store schema • This is enabled for HCatalog table imports as well. • Automatic mapping with optional user overrides. • Ability to provide a storage options for the newly created table. • All HCatalog primitive types supported
Data fidelity • With Text based imports (as in Sqoop hive-import option), the text values have to be massaged so that delimiters are not misinterpreted. • Sqoop provides two options to handle this. --hive-delims-replacement --hive-drop-import-delims • Error prone and data is modified to be stored on Hive
Data fidelity • With HCatalog table imports to file formats like RCFile, ORCFileetc, there is no need to strip these delimiters in column values. • Data is preserved without any massaging • If the target Hcatalog table file format is Text, then the two options can still be used as before. --hive-delims-replacement --hive-drop-import-delims
Support for static and dynamic partitioning • HCatalog tables partition keys can be dynamic or static. • Static partitioning keys have values provided as part of the DML (known at Query compile time) • Dynamic partitioning keys have values provided at execution time. • Based on value of a column being imported
Support for static and dynamic partitioning • Both types of tables supported during import. • Multiple partition keys per table are supported. • Only one can be a static partition key can be specified (Sqoop restriction). • Only table with one partitioning key can be automatically created.
Benefits • Future proof your Sqoop jobs by making them agnostic of file-formats used • Remove additional steps before taking data to the target table format • Preserve data contents
Availability & Documentation • Part of Sqoop 1.4.4 release • A chapter devoted to HCatalog integration in the User Guide • URL: https://sqoop.apache.org/docs/1.4.4/SqoopUserGuide.html#_sqoop_hcatalog_integration