150 likes | 850 Views
Sqoop 2 Introduction. Mengwei Ding, Software Engineer Intern at Cloudera. What is Sqoop. Apache Top-Level Project SQl and hadOOP Transfer a large bulk of data From relational data warehouses: Teradata, MySQL, PostgreSQL , Oracle, Netezza
E N D
Sqoop 2 Introduction Mengwei Ding, Software Engineer Intern at Cloudera
What is Sqoop • Apache Top-Level Project • SQl and hadOOP • Transfer a large bulk of data • From relational data warehouses: Teradata, MySQL, PostgreSQL, Oracle, Netezza • ToHadoop ecosystem: HDFS, Hive, HBase, Avio • Vice versa • Sqoop 1(1.4.3) and Sqoop 2(1.99.2)
Sqoop 1 Challenges • Command line tool, configured with line arguments(60+!) • Connector-driven: • Responsible for metadata lookups and data transfer • JDBC vocabulary-enforced (--connect) • Implicit connector selection • Non-uniform, duplicated functionality • Client accesses hadoop configurations and databases directly • Security Concerns: • Client needs to know credentials to databases • Type mapping is not clearly defined
Sqoop 2 - Design Goals • Same goal: transfer data around • Ease of Use • Sqoop as a Service • Domain Specific Interactions without too many args • Ease of Extension • No low-level Hadoop knowledge needed • Uniform functionality of connectors, no functional overlap between connectors • Security and Separation of Concerns • Role based access and use
Sqoop 2 - Connection vs Job Metadata • There are two distinct sets of options • Connection (distinct per database) • Job (distinct per table)
Sqoop 2 - Connection vs Job Metadata • Another distinct two sets of arguments • Connector specific • Shared across all connectors
Sqoop 2 - Security • Support for secure access to external system via role-based access to connection objects • Administrators create/edit/delete connections • Operators use connections • Connection encompass credentials • Connection created once, then reused later • Created by Admin, used by operator to safeguard credential access from end user
Sqoop 2 - Resource Management • Connections allow specification of resource policy • Administrator can limit the total number of physical connections open at one time • Connections can be disabled
Sqoop 2 - Current Status • Primary focus of Sqoop community • Second cut: 1.99.2 • bits and docs: http://sqoop.apache.org