TableInputFormat (Pig 0.9.3-SNAPSHOT API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.zebra.mapreduce
Class TableInputFormat

java.lang.Object
  org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
      org.apache.hadoop.zebra.mapreduce.TableInputFormat

public class TableInputFormat
extends org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
extends org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>

InputFormat class for reading one or more BasicTables. Usage Example:

In the main program, add the following code.

 job.setInputFormatClass(TableInputFormat.class);
 TableInputFormat.setInputPaths(jobContext, new Path("path/to/table1", new Path("path/to/table2");
 TableInputFormat.setProjection(jobContext, "Name, Salary, BonusPct");

The above code does the following things:

Set the input format class to TableInputFormat.
Set the paths to the BasicTables to be consumed by user's Mapper code.
Set the projection on the input tables. In this case, the Mapper code is only interested in three fields: "Name", "Salary", "BonusPct". "Salary" (perhaps for the purpose of calculating the person's total payout). If no project is specified, then all columns from the input tables will be retrieved. If input tables have different schemas, then the input contains the union of all columns from all the input tables. Absent fields will be left as nul in the input tuple.

The user Mapper code should look like the following:

 static class MyMapClass implements Mapper<BytesWritable, Tuple, K, V> {
   // keep the tuple object for reuse.
   // indices of various fields in the input Tuple.
   int idxName, idxSalary, idxBonusPct;
 
   @Override
   public void configure(Job job) {
     Schema projection = TableInputFormat.getProjection(job);
     // determine the field indices.
     idxName = projection.getColumnIndex("Name");
     idxSalary = projection.getColumnIndex("Salary");
     idxBonusPct = projection.getColumnIndex("BonusPct");
   }
 
   @Override
   public void map(BytesWritable key, Tuple value, OutputCollector<K, V> output,
       Reporter reporter) throws IOException {
     try {
       String name = (String) value.get(idxName);
       int salary = (Integer) value.get(idxSalary);
       double bonusPct = (Double) value.get(idxBonusPct);
       // do something with the input data
     } catch (ExecException e) {
       e.printStackTrace();
     }
   }
 
   @Override
   public void close() throws IOException {
     // no-op
   }
 }

A little bit more explanation on the PIG Tuple objects. A Tuple is an ordered list of PIG datum objects. The permitted PIG datum types can be categorized as Scalar types and Composite types.

Supported Scalar types include seven native Java types: Boolean, Byte, Integer, Long, Float, Double, String, as well as one PIG class called DataByteArray that represents type-less byte array.

Supported Composite types include:

Map : It is the same as Java Map class, with the additional restriction that the key-type must be one of the scalar types PIG recognizes, and the value-type any of the scaler or composite types PIG understands.
DataBag : A DataBag is a collection of Tuples.
Tuple : Yes, Tuple itself can be a datum in another Tuple.

Nested Class Summary
`static class`	`TableInputFormat.SplitMode`

Constructor Summary
`TableInputFormat()`

Method Summary
`org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.BytesWritable,Tuple>`	`createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext taContext)`
`static TableRecordReader`	`createTableRecordReader(org.apache.hadoop.mapreduce.JobContext jobContext, String projection)` Get a TableRecordReader on a single split
`static String`	`getProjection(org.apache.hadoop.mapreduce.JobContext jobContext)` Get the projection from the JobContext
`static Schema`	`getSchema(org.apache.hadoop.mapreduce.JobContext jobContext)` Get the schema of a table expr
`static org.apache.hadoop.io.WritableComparable<?>`	`getSortedTableSplitComparable(org.apache.hadoop.mapreduce.InputSplit inputSplit)` Get a comparable object from the given InputSplit object.
`static SortInfo`	`getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext)` Get the SortInfo object regarding a Zebra table
`List<org.apache.hadoop.mapreduce.InputSplit>`	`getSplits(org.apache.hadoop.mapreduce.JobContext jobContext)`
`static void`	`requireSortedTable(org.apache.hadoop.mapreduce.JobContext jobContext, ZebraSortInfo sortInfo)` Deprecated.
`static void`	`setInputPaths(org.apache.hadoop.mapreduce.JobContext jobContext, org.apache.hadoop.fs.Path... paths)` Set the paths to the input table.
`static void`	`setMinSplitSize(org.apache.hadoop.mapreduce.JobContext jobContext, long minSize)` Set the minimum split size.
`static void`	`setProjection(org.apache.hadoop.mapreduce.JobContext jobContext, String projection)` Deprecated. Use `setProjection(JobContext, ZebraProjection)` instead.
`static void`	`setProjection(org.apache.hadoop.mapreduce.JobContext jobContext, ZebraProjection projection)` Set the input projection in the JobContext object.
`static void`	`setSplitMode(org.apache.hadoop.mapreduce.JobContext jobContext, TableInputFormat.SplitMode sm, ZebraSortInfo sortInfo)`
`void`	`validateInput(org.apache.hadoop.mapreduce.JobContext jobContext)` Deprecated.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail