Setting Up Apache Sqoop 2 Using the Command Line
Sqoop 2 is a server-based tool designed to transfer data between Hadoop and relational databases. You can use Sqoop 2 to import data from a relational database management system (RDBMS), such as MySQL or Oracle, into the Hadoop Distributed File System (HDFS), transform the data with Hadoop MapReduce, and then export it back into an RDBMS.

Sqoop 2 has three packaging options for installation:
- Tarball (.tgz) that contains both the Sqoop 2 server and the client.
- Separate RPM packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
- Separate Debian packages for Sqoop 2 server (sqoop2-server) and client (sqoop2-client)
These topics describe the steps to install Sqoop 2.
Feature Differences - Sqoop 1 and Sqoop 2

Feature | Sqoop 1 | Sqoop 2 |
---|---|---|
Connectors for all major RDBMS | Supported. |
Not supported. Workaround: Use the generic JDBC Connector which has been tested on the following databases: Microsoft SQL Server, PostgreSQL, MySQL and Oracle. This connector should work on any other JDBC compliant database. However, performance might not be comparable to that of specialized connectors in Sqoop. |
Kerberos Security Integration | Supported. |
Supported. |
Data transfer from RDBMS to Hive or HBase | Supported. |
Not supported. Workaround: Follow this two-step approach.
|
Data transfer from Hive or HBase to RDBMS | Not supported.
Workaround: Follow this two-step approach.
|
Not supported. Follow the same workaround as for Sqoop 1. |
Sqoop 2 Prerequisites

- An operating system supported by CDH 5.
- Oracle JDK.
-
Hadoop must be installed on the host that runs the Sqoop 2 server component.
- Services that you want to use with Sqoop, such as HBase, Hive HCatalog, and Accumulo. Sqoop checks for these services when you run it, and finds services that are installed and
configured. It logs warnings for services it does not find. These warnings, shown below, are harmless.
> Warning: /usr/lib/sqoop/../hbase does not exist! HBase imports will fail. > Please set $HBASE_HOME to the root of your HBase installation. > Warning: /usr/lib/sqoop/../hive-hcatalog does not exist! HCatalog jobs will fail. > Please set $HCAT_HOME to the root of your HCatalog installation. > Warning: /usr/lib/sqoop/../accumulo does not exist! Accumulo imports will fail. > Please set $ACCUMULO_HOME to the root of your Accumulo installation.
Installing Sqoop 2
Sqoop 2 is distributed as two separate packages: a client package (sqoop2-client) and a server package (sqoop2-server). Install the server package on one host in the cluster; because the Sqoop 2 server acts as a MapReduce client, this host must have Hadoop installed and configured.
Install the client package on each host that acts as a client. A Sqoop 2 client always connects to the Sqoop 2 server to perform any actions, so Hadoop does not need to be installed on the client hosts.
Depending on what you are planning to install, choose the appropriate package and install it using your preferred package manager application.

To install the Sqoop 2 server package on a RHEL-compatible system:

- Install the Cloudera yum, zypper/YaST or apt repository.
- Install or upgrade CDH 5 and make sure it is functioning correctly.
$ sudo yum install sqoop2-server
To install the Sqoop 2 client package on a RHEL-compatible system:
$ sudo yum install sqoop2-client
To install the Sqoop 2 server package on a SLES system:
$ sudo zypper install sqoop2-server
To install the Sqoop 2 client package on a SLES system:
$ sudo zypper install sqoop2-client
To install the Sqoop 2 server package on an Ubuntu or Debian system:
$ sudo apt-get install sqoop2-server
To install the Sqoop 2 client package on an Ubuntu or Debian system:
$ sudo apt-get install sqoop2-client

Installing the sqoop2-server package creates a sqoop-server service configured to start Sqoop 2 at system startup time.
Configuring Sqoop 2
This section explains how to configure the Sqoop 2 server.

Configuring which Hadoop Version to Use
The Sqoop 2 client does not interact directly with Hadoop MapReduce, and so it does not require any MapReduce configuration.
- To use YARN:
alternatives --set sqoop2-tomcat-conf /etc/sqoop2/tomcat-conf.dist
- To use MRv1:
alternatives --set sqoop2-tomcat-conf /etc/sqoop2/tomcat-conf.mr1

In earlier releases, the mechanism for setting the MapReduce version was the CATALINA_BASEvariable in the /etc/defaults/sqoop2-server file. This does not work as of CDH 5 Beta 2, and in fact could cause problems. Check your /etc/defaults/sqoop2-server file and make sure CATALINA_BASE is not set.
Configuring Sqoop 2 to Use PostgreSQL instead of Apache Derby
Deciding which Database to Use
- Derby runs in embedded mode and it is not possible to monitor its health.
- Though it might be possible, Cloudera currently has no live backup strategy for the embedded Derby database.
- Under load, Cloudera has observed locks and rollbacks with the embedded Derby database that do not happen with server-based databases.

Cloudera currently has no recommended way to migrate data from an existing Derby database into the new PostgreSQL database.
Use the procedure that follows to configure Sqoop 2 to use PostgreSQL instead of Apache Derby.
- Install PostgreSQL 8.4.x or 9.0.x
- Create the Sqoop User and Sqoop Database
- Stop the Sqoop 2 Server
- Configure Sqoop 2 to use PostgreSQL
- Restart the Sqoop 2 Server
Install PostgreSQL 8.4.x or 9.0.x
Create the Sqoop User and Sqoop Database
For example, using the PostgreSQL psql command-line tool:
$ psql -U postgres Password for user postgres: ***** postgres=# CREATE ROLE sqoop LOGIN ENCRYPTED PASSWORD 'sqoop' NOSUPERUSER INHERIT CREATEDB NOCREATEROLE; CREATE ROLE postgres=# CREATE DATABASE "sqoop" WITH OWNER = sqoop ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'en_US.UTF8' LC_CTYPE = 'en_US.UTF8' CONNECTION LIMIT = -1; CREATE DATABASE postgres=# \q
Stop the Sqoop 2 Server
$ sudo /sbin/service sqoop2-server stop
Configure Sqoop 2 to use PostgreSQL
Edit the sqoop.properties file (normally /etc/sqoop2/conf) as follows:
org.apache.sqoop.repository.jdbc.handler=org.apache.sqoop.repository.postgresql.PostgresqlRepositoryHandler org.apache.sqoop.repository.jdbc.transaction.isolation=isolation level org.apache.sqoop.repository.jdbc.maximum.connections=max connections org.apache.sqoop.repository.jdbc.url=jdbc URL org.apache.sqoop.repository.jdbc.driver=org.postgresql.Driver org.apache.sqoop.repository.jdbc.user=username org.apache.sqoop.repository.jdbc.password=password org.apache.sqoop.repository.jdbc.properties.property=value

- Replace isolation level with a value such as READ_COMMITTED.
- Replace max connections with a value such as 10.
- Replace jdbc URL with the hostname on which you installed PostgreSQL.
- Replace username with (in this example) sqoop
- Replace password with (in this example) sqoop
- Use org.apache.sqoop.repository.jdbc.properties.property to set each additional property you want to configure; see https://jdbc.postgresql.org/documentation/head/connect.html for details. For example, replace property with loglevel and value with 3
Restart the Sqoop 2 Server
$ sudo /sbin/service sqoop2-server start
Installing the JDBC Drivers

The JDBC drivers need to be installed only on the machine running Sqoop. You do not need to install them on all hosts in your Hadoop cluster.
Installing the MySQL JDBC Driver
Download the MySQL JDBC driver here. You must sign up for an account if you do not already have one, then log in before you can download the driver. Copy it to the /var/lib/sqoop2/ directory. For example:
$ sudo cp mysql-connector-java-version/mysql-connector-java-version-bin.jar /var/lib/sqoop2/At the time of publication, version was 5.1.31, but the version might change by the time you read this.

Make sure you have at least version 5.1.31. Some systems ship with an earlier version that might not work correctly with Sqoop.
Installing the Oracle JDBC Driver
You can download the JDBC Driver from the Oracle website, for example here. You must accept the license agreement before you can download the driver. Download the ojdbc6.jar file and copy it to /var/lib/sqoop2/ directory:
$ sudo cp ojdbc6.jar /var/lib/sqoop2/
Installing the Microsoft SQL Server JDBC Driver
Download the Microsoft SQL Server JDBC driver here and copy it to the /var/lib/sqoop2/ directory. For example:
$ curl -L 'http://download.microsoft.com/download/0/2/A/02AAE597-3865-456C-AE7F-613F99F850A8/sqljdbc_4.0.2206.100_enu.tar.gz' | tar xz $ sudo cp sqljdbc_4.0/enu/sqljdbc4.jar /var/lib/sqoop2/
Installing the PostgreSQL JDBC Driver
Download the PostgreSQL JDBC driver here and copy it to the /var/lib/sqoop2/ directory. For example:
$ curl -L 'http://jdbc.postgresql.org/download/postgresql-9.2-1002.jdbc4.jar' -o postgresql-9.2-1002.jdbc4.jar $ sudo cp postgresql-9.2-1002.jdbc4.jar /var/lib/sqoop2/
Syntax for Configuring JDBC Connection Strings
These are the JDBC connection strings for supported databases.
MySql Connection String
Syntax:
jdbc:mysql://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:mysql://my_mysql_server_hostname:3306/my_database_name
Oracle Connection String
Syntax:
jdbc:oracle:thin:@<HOST>:<PORT>:<DATABASE_NAME>
Example:
jdbc:oracle:thin:@my_oracle_server_hostname:1521:my_database_name
PostgreSQL Connection String
Syntax:
jdbc:postgresql://<HOST>:<PORT>/<DATABASE_NAME>
Example:
jdbc:postgresql://my_postgres_server_hostname:5432/my_database_name
Netezza Connection String
Syntax:
jdbc:netezza://<HOST>:<PORT>/<DATABASE_NAME>Example:
jdbc:netezza://my_netezza_server_hostname:5480/my_database_name
Teradata Connection String
Syntax:
jdbc:teradata://<HOST>/DBS_PORT=1025/DATABASE=<DATABASE_NAME>
jdbc:teradata://my_teradata_server_hostname/DBS_PORT=1025/DATABASE=my_database_name
Starting, Stopping, and Accessing the Sqoop 2 Server
Starting the Sqoop 2 Server

After you have completed all of the required configuration steps, you can start Sqoop 2 server:
$ sudo /sbin/service sqoop2-server start
Stopping the Sqoop 2 Server
$ sudo /sbin/service sqoop2-server stop
Checking that the Sqoop 2 Server has Started
You can verify whether the server has started correctly by connecting to its HTTP interface. The simplest way is to get the server version using following command:
$ wget -qO - localhost:12000/sqoop/version
You should get a text fragment in JSON format similar to the following:
{"version":"1.99.2-cdh5.0.0",...}
Accessing the Sqoop 2 Server with the Sqoop 2 Client
Start the Sqoop 2 client:
sqoop2
Identify the host where your server is running (we will use localhost in this example):
sqoop:000> set server --host localhost
Test the connection by running the command show version --all to obtain the version number from server. You should see output similar to the following:
sqoop:000> show version --all server version: Sqoop 1.99.2-cdh5.0.0 revision ... Compiled by jenkins on ... client version: Sqoop 1.99.2-cdh5.0.0 revision ... Compiled by jenkins on ... Protocol version: [1]
Viewing the Sqoop 2 Documentation

<< Setting Up Apache Sqoop Using the Command Line | ©2016 Cloudera, Inc. All rights reserved | Setting Up Apache Whirr Using the Command Line >> |
Terms and Conditions Privacy Policy |