Cloudera Enterprise 5.16.x | Other versions

Setting Up Apache Sqoop 2 Using the Command Line

Sqoop 2 is a server-based tool designed to transfer data between Hadoop and relational databases. You can use Sqoop 2 to import data from a relational database management system (RDBMS), such as MySQL or Oracle, into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export it back into an RDBMS.

  Note: Sqoop 2 is deprecated. Cloudera recommends you use Sqoop 1.

Sqoop 2 has three packaging options for installation:

  • Tarball (.tgz) that contains both the Sqoop 2 server and the client.
  • Separate RPM packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
  • Separate Debian packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)

These topics describe the steps to install Sqoop 2.

Feature Differences - Sqoop 1 and Sqoop 2

  Note: Sqoop 2 is being deprecated. Cloudera recommends using Sqoop 1.
Feature Sqoop 1 Sqoop 2
Connectors for all major RDBMS Supported.

Not supported.

Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle.

This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop.

Kerberos Security Integration Supported.

Supported.

Data transfer from RDBMS to Hive or HBase Supported.

Not supported.

Workaround: Follow this two-step approach.
  1. Import data from RDBMS into HDFS
  2. Load data into Hive or HBase manually using appropriate tools and commands such as the LOAD DATA statement in Hive
Data transfer from Hive or HBase to RDBMS Not supported.
Workaround: Follow this two-step approach.
  1. Extract data from Hive or HBase into HDFS (either as a text or Avro file)
  2. Use Sqoop to export output of previous step to RDBMS

Not supported.

Follow the same workaround as for Sqoop 1.

Sqoop 2 Prerequisites

  Note: Sqoop 2 is deprecated. Cloudera recommends you use Sqoop 1.
  • An operating system supported by CDH 5.
  • Oracle JDK.
  • Hadoop must be installed on the host that runs the Sqoop 2 server component.

  • Services that you want to use with Sqoop, such as HBase, Hive HCatalog, and Accumulo. Sqoop checks for these services when you run it, and finds services that are installed and configured. It logs warnings for services it does not find. These warnings, shown below, are harmless.
    > Warning: /usr/lib/sqoop/../hbase does not exist! HBase imports will fail.
    > Please set $HBASE_HOME to the root of your HBase installation.
    > Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail.
    > Please set $HCAT_HOME to the root of your HCatalog installation.
    > Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail.
    > Please set $ACCUMULO_HOME to the root of your Accumulo installation. 

Installing Sqoop 2

Sqoop 2 is distributed as two separate packages: a client package (sqoop2-client) and a server package (sqoop2-server). Install the server package on one host in the cluster; because the Sqoop 2 server acts as a MapReduce client, this host must have Hadoop installed and configured.

Install the client package on each host that acts as a client. A Sqoop 2 client always connects to the Sqoop 2 server to perform any actions, so Hadoop does not need to be installed on the client hosts.

Depending on what you are planning to install, choose the appropriate package and install it using your preferred package manager application.

  Note: The Sqoop 2 packages cannot be installed on the same machines as Sqoop1 packages. However you can use both versions in the same Hadoop cluster by installing Sqoop1 and Sqoop 2 on different hosts.

To install the Sqoop 2 server package on a RHEL-compatible system:

  Note: Install Cloudera Repository
Before using the instructions on this page to install or upgrade:
  • Install the Cloudera yum, zypper/YaST or apt repository.
  • Install or upgrade CDH 5 and make sure it is functioning correctly.
For instructions, see Installing and Deploying Unmanaged CDH Using the Command Line and Upgrading Unmanaged CDH Using the Command Line.
$ sudo yum install sqoop2-server

To install the Sqoop 2 client package on a RHEL-compatible system:

$ sudo yum install sqoop2-client

To install the Sqoop 2 server package on a SLES system:

$ sudo zypper install sqoop2-server

To install the Sqoop 2 client package on a SLES system:

$ sudo zypper install sqoop2-client

To install the Sqoop 2 server package on an Ubuntu or Debian system:

$ sudo apt-get install sqoop2-server

To install the Sqoop 2 client package on an Ubuntu or Debian system:

$ sudo apt-get install sqoop2-client
  Note:

Installing the sqoop2-server package creates a sqoop-server service configured to start Sqoop 2 at system startup time.

Configuring Sqoop 2

This section explains how to configure the Sqoop 2 server.

  Note: Sqoop 2 is deprecated. Cloudera recommends you use Sqoop 1.

Configuring which Hadoop Version to Use

The Sqoop 2 client does not interact directly with Hadoop MapReduce, and so it does not require any MapReduce configuration.

The Sqoop 2 server can work with either MRv1 or YARN. It cannot work with both simultaneously.You set the MapReduce version the Sqoop 2 server works with by means of the alternatives command (or update-alternatives, depending on your operating system):
  • To use YARN:
    alternatives --set sqoop2-tomcat-conf /etc/sqoop2/tomcat-conf.dist
  • To use MRv1:
    alternatives --set sqoop2-tomcat-conf /etc/sqoop2/tomcat-conf.mr1
  Important: If you are upgrading from a release earlier than CDH 5 Beta 2

In earlier releases, the mechanism for setting the MapReduce version was the CATALINA_BASEvariable in the /etc/defaults/sqoop2-server file. This does not work as of CDH 5 Beta 2, and in fact could cause problems. Check your /etc/defaults/sqoop2-server file and make sure CATALINA_BASE is not set.

Configuring Sqoop 2 to Use PostgreSQL instead of Apache Derby

Deciding which Database to Use

Sqoop 2 has a built-in Derby database, but Cloudera recommends that you use a PostgreSQL database instead, for the following reasons:
  • Derby runs in embedded mode and it is not possible to monitor its health.
  • Though it might be possible, Cloudera currently has no live backup strategy for the embedded Derby database.
  • Under load, Cloudera has observed locks and rollbacks with the embedded Derby database that do not happen with server-based databases.
See CDH and Cloudera Manager Supported Databases for tested database versions.
  Note:

Cloudera currently has no recommended way to migrate data from an existing Derby database into the new PostgreSQL database.

Use the procedure that follows to configure Sqoop 2 to use PostgreSQL instead of Apache Derby.

  1. Install PostgreSQL 8.4.x or 9.0.x
  2. Create the Sqoop User and Sqoop Database
  3. Stop the Sqoop 2 Server
  4. Configure Sqoop 2 to use PostgreSQL
  5. Restart the Sqoop 2 Server
Install PostgreSQL 8.4.x or 9.0.x

See Install and Configure PostgreSQL for Cloudera Software.

Create the Sqoop User and Sqoop Database

For example, using the PostgreSQL psql command-line tool:

$ psql -U postgres
Password for user postgres: *****

postgres=# CREATE ROLE sqoop LOGIN ENCRYPTED PASSWORD 'sqoop'
 NOSUPERUSER INHERIT CREATEDB NOCREATEROLE;
CREATE ROLE

postgres=# CREATE DATABASE "sqoop" WITH OWNER = sqoop
 ENCODING = 'UTF8'
 TABLESPACE = pg_default
 LC_COLLATE = 'en_US.UTF8'
 LC_CTYPE = 'en_US.UTF8'
 CONNECTION LIMIT = -1;
CREATE DATABASE

postgres=# \q
Stop the Sqoop 2 Server
$ sudo /sbin/service sqoop2-server stop
Configure Sqoop 2 to use PostgreSQL

Edit the sqoop.properties file (normally /etc/sqoop2/conf) as follows:

org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.postgresql.PostgresqlRepositoryHandler
org.apache.sqoop.repository.jdbc.transaction.isolation=isolation level
org.apache.sqoop.repository.jdbc.maximum.connections=max connections
org.apache.sqoop.repository.jdbc.url=jdbc URL
org.apache.sqoop.repository.jdbc.driver=org.postgresql.Driver
org.apache.sqoop.repository.jdbc.user=username
org.apache.sqoop.repository.jdbc.password=password
org.apache.sqoop.repository.jdbc.properties.property=value
  Note:
  • Replace isolation level with a value such as READ_COMMITTED.
  • Replace max connections with a value such as 10.
  • Replace jdbc URL with the hostname on which you installed PostgreSQL.
  • Replace username with (in this example) sqoop
  • Replace password with (in this example) sqoop
  • Use org.apache.sqoop.repository.jdbc.properties.property to set each additional property you want to configure; see https://jdbc.postgresql.org/documentation/head/connect.html for details. For example, replace property with loglevel and value with 3
Restart the Sqoop 2 Server
$ sudo /sbin/service sqoop2-server start

Installing the JDBC Drivers

Sqoop 2 does not ship with third party JDBC drivers. You must download them separately and save them to the /var/lib/sqoop2/ directory on the server. The following sections show how to install the most common JDBC drivers. Once you have installed the JDBC drivers, restart the Sqoop 2 server so that the drivers are loaded.
  Note:

The JDBC drivers need to be installed only on the machine running Sqoop. You do not need to install them on all hosts in your Hadoop cluster.

Installing the MySQL JDBC Driver

Download the MySQL JDBC driver here. You must sign up for an account if you do not already have one, then log in before you can download the driver. Copy it to the /var/lib/sqoop2/ directory. For example:

$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar /var/lib/sqoop2/
At the time of publication, version was 5.1.31, but the version might change by the time you read this.
  Important:

Make sure you have at least version 5.1.31. Some systems ship with an earlier version that might not work correctly with Sqoop.

Installing the Oracle JDBC Driver

You can download the JDBC Driver from the Oracle website, for example here. You must accept the license agreement before you can download the driver. Download the ojdbc6.jar file and copy it to /var/lib/sqoop2/ directory:

$ sudo cp ojdbc6.jar /var/lib/sqoop2/

Installing the Microsoft SQL Server JDBC Driver

Download the Microsoft SQL Server JDBC driver here and copy it to the /var/lib/sqoop2/ directory. For example:

$ curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz
$ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/

Installing the PostgreSQL JDBC Driver

Download the PostgreSQL JDBC driver here and copy it to the /var/lib/sqoop2/ directory. For example:

$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar
$ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop2/

Syntax for Configuring JDBC Connection Strings

These are the JDBC connection strings for supported databases.

MySql Connection String

Syntax:

jdbc:mysql://<HOST>:<PORT>/<DATABASE_NAME>

Example:

jdbc:mysql://my_mysql_server_hostname:3306/my_database_name
Oracle Connection String

Syntax:

jdbc:oracle:thin:@<HOST>:<PORT>:<DATABASE_NAME>

Example:

jdbc:oracle:thin:@my_oracle_server_hostname:1521:my_database_name
PostgreSQL Connection String

Syntax:

jdbc:postgresql://<HOST>:<PORT>/<DATABASE_NAME>

Example:

jdbc:postgresql://my_postgres_server_hostname:5432/my_database_name
Netezza Connection String

Syntax:

jdbc:netezza://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:netezza://my_netezza_server_hostname:5480/my_database_name
Teradata Connection String

Syntax:

jdbc:teradata://<HOST>/DBS_PORT=1025/DATABASE=<DATABASE_NAME>
Example:
jdbc:teradata://my_teradata_server_hostname/DBS_PORT=1025/DATABASE=my_database_name

Starting, Stopping, and Accessing the Sqoop 2 Server

Starting the Sqoop 2 Server

  Note: Sqoop 2 is deprecated and will be removed from CDH in a future release. Cloudera recommends using Sqoop 1.

After you have completed all of the required configuration steps, you can start Sqoop 2 server:

$ sudo /sbin/service sqoop2-server start

Stopping the Sqoop 2 Server

$ sudo /sbin/service sqoop2-server stop

Checking that the Sqoop 2 Server has Started

You can verify whether the server has started correctly by connecting to its HTTP interface. The simplest way is to get the server version using following command:

$ wget -qO - localhost:12000/sqoop/version

You should get a text fragment in JSON format similar to the following:

{"version":"1.99.2-cdh5.0.0",...}

Accessing the Sqoop 2 Server with the Sqoop 2 Client

Start the Sqoop 2 client:

sqoop2

Identify the host where your server is running (we will use localhost in this example):

sqoop:000> set server --host localhost

Test the connection by running the command show version --all to obtain the version number from server. You should see output similar to the following:

sqoop:000> show version --all
server version:
  Sqoop 1.99.2-cdh5.0.0 revision ...
  Compiled by jenkins on ...
client version:
  Sqoop 1.99.2-cdh5.0.0 revision ...
  Compiled by jenkins on ...
Protocol version:
  [1]

Viewing the Sqoop 2 Documentation

For more information about Sqoop 2, see Highlights of Sqoop 2 and https://archive.cloudera.com/cdh5/cdh/5/sqoop2.
  Note: Sqoop 2 is deprecated. Cloudera recommends you use Sqoop 1.
Page generated October 24, 2018.