Cloudera Enterprise 5.16.x | Other versions

Setting Up Apache Pig Using the Command Line

Apache Pig enables you to analyze large amounts of data using Pig's query language called Pig Latin. Pig Latin queries run in a distributed way on a Hadoop cluster.

Installing Pig

  Note: Install Cloudera Repository
Before using the instructions on this page to install or upgrade:
  • Install the Cloudera yum, zypper/YaST or apt repository.
  • Install or upgrade CDH 5 and make sure it is functioning correctly.
For instructions, see Installing and Deploying Unmanaged CDH Using the Command Line and Upgrading Unmanaged CDH Using the Command Line.

To install Pig On RHEL-compatible systems:

$ sudo yum install pig

To install Pig on SLES systems:

$ sudo zypper install pig

To install Pig on Ubuntu and other Debian systems:

$ sudo apt-get install pig
  Note:

Pig automatically uses the active Hadoop configuration (whether standalone, pseudo-distributed mode, or distributed). After installing the Pig package, you can start Pig.

To start Pig in interactive mode (YARN)

  Important:
  • For each user who will be submitting MapReduce jobs using MapReduce v2 (YARN), or running Pig, Hive, or Sqoop in a YARN installation, make sure that the HADOOP_MAPRED_HOME environment variable is set correctly, as follows:
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-mapreduce
  • For each user who will be submitting MapReduce jobs using MapReduce v1 (MRv1), or running Pig, Hive, or Sqoop in an MRv1 installation, set the HADOOP_MAPRED_HOME environment variable as follows:
    $ export HADOOP_MAPRED_HOME=/usr/lib/hadoop-0.20-mapreduce

To start Pig, use the following command.

$ pig

To start Pig in interactive mode (MRv1)

Use the following command:

$ pig 
You should see output similar to the following:
2012-02-08 23:39:41,819 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/arvind/pig-0.11.0-cdh5b1/bin/pig_1328773181817.log
2012-02-08 23:39:41,994 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://localhost/
...
grunt>

Examples

To verify that the input and output directories from the YARN or MRv1 example grep job exist, list an HDFS directory from the Grunt Shell:
grunt> ls
hdfs://localhost/user/joe/input <dir>
hdfs://localhost/user/joe/output <dir>
To run a grep example job using Pig for grep inputs:
grunt> A = LOAD 'input';
grunt> B = FILTER A BY $0 MATCHES '.*dfs[a-z.]+.*';
grunt> DUMP B;
  Note:

To check the status of your job while it is running, look at the ResourceManager web console (YARN) or JobTracker web console (MRv1).

Page generated October 24, 2018.