|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
org.apache.hadoop.zebra.mapreduce.TableInputFormat
public class TableInputFormat
InputFormat
class for reading one or more
BasicTables.
Usage Example:
In the main program, add the following code.
job.setInputFormatClass(TableInputFormat.class); TableInputFormat.setInputPaths(jobContext, new Path("path/to/table1", new Path("path/to/table2"); TableInputFormat.setProjection(jobContext, "Name, Salary, BonusPct");The above code does the following things:
static class MyMapClass implements Mapper<BytesWritable, Tuple, K, V> { // keep the tuple object for reuse. // indices of various fields in the input Tuple. int idxName, idxSalary, idxBonusPct; @Override public void configure(Job job) { Schema projection = TableInputFormat.getProjection(job); // determine the field indices. idxName = projection.getColumnIndex("Name"); idxSalary = projection.getColumnIndex("Salary"); idxBonusPct = projection.getColumnIndex("BonusPct"); } @Override public void map(BytesWritable key, Tuple value, OutputCollector<K, V> output, Reporter reporter) throws IOException { try { String name = (String) value.get(idxName); int salary = (Integer) value.get(idxSalary); double bonusPct = (Double) value.get(idxBonusPct); // do something with the input data } catch (ExecException e) { e.printStackTrace(); } } @Override public void close() throws IOException { // no-op } }A little bit more explanation on the PIG
Tuple
objects. A Tuple is an
ordered list of PIG datum objects. The permitted PIG datum types can be
categorized as Scalar types and Composite types.
Supported Scalar types include seven native Java types: Boolean, Byte,
Integer, Long, Float, Double, String, as well as one PIG class called
DataByteArray
that represents type-less byte array.
Supported Composite types include:
Map
: It is the same as Java Map class, with the additional
restriction that the key-type must be one of the scalar types PIG recognizes,
and the value-type any of the scaler or composite types PIG understands.
DataBag
: A DataBag is a collection of Tuples.
Tuple
: Yes, Tuple itself can be a datum in another Tuple.
Nested Class Summary | |
---|---|
static class |
TableInputFormat.SplitMode
|
Constructor Summary | |
---|---|
TableInputFormat()
|
Method Summary | |
---|---|
org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.BytesWritable,Tuple> |
createRecordReader(org.apache.hadoop.mapreduce.InputSplit split,
org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
|
static TableRecordReader |
createTableRecordReader(org.apache.hadoop.mapreduce.JobContext jobContext,
String projection)
Get a TableRecordReader on a single split |
static String |
getProjection(org.apache.hadoop.mapreduce.JobContext jobContext)
Get the projection from the JobContext |
static Schema |
getSchema(org.apache.hadoop.mapreduce.JobContext jobContext)
Get the schema of a table expr |
static org.apache.hadoop.io.WritableComparable<?> |
getSortedTableSplitComparable(org.apache.hadoop.mapreduce.InputSplit inputSplit)
Get a comparable object from the given InputSplit object. |
static SortInfo |
getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext)
Get the SortInfo object regarding a Zebra table |
List<org.apache.hadoop.mapreduce.InputSplit> |
getSplits(org.apache.hadoop.mapreduce.JobContext jobContext)
|
static void |
requireSortedTable(org.apache.hadoop.mapreduce.JobContext jobContext,
ZebraSortInfo sortInfo)
Deprecated. |
static void |
setInputPaths(org.apache.hadoop.mapreduce.JobContext jobContext,
org.apache.hadoop.fs.Path... paths)
Set the paths to the input table. |
static void |
setMinSplitSize(org.apache.hadoop.mapreduce.JobContext jobContext,
long minSize)
Set the minimum split size. |
static void |
setProjection(org.apache.hadoop.mapreduce.JobContext jobContext,
String projection)
Deprecated. Use setProjection(JobContext, ZebraProjection) instead. |
static void |
setProjection(org.apache.hadoop.mapreduce.JobContext jobContext,
ZebraProjection projection)
Set the input projection in the JobContext object. |
static void |
setSplitMode(org.apache.hadoop.mapreduce.JobContext jobContext,
TableInputFormat.SplitMode sm,
ZebraSortInfo sortInfo)
|
void |
validateInput(org.apache.hadoop.mapreduce.JobContext jobContext)
Deprecated. |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public TableInputFormat()
Method Detail |
---|
public static void setInputPaths(org.apache.hadoop.mapreduce.JobContext jobContext, org.apache.hadoop.fs.Path... paths)
conf
- JobContext object.paths
- one or more paths to BasicTables. The InputFormat class will
produce splits on the "union" of these BasicTables.public static Schema getSchema(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException
jobContext
- JobContext object.
IOException
public static void setProjection(org.apache.hadoop.mapreduce.JobContext jobContext, String projection) throws ParseException
setProjection(JobContext, ZebraProjection)
instead.
jobContext
- JobContext object.projection
- A common separated list of column names. If we want select all
columns, pass projection==null. The syntax of the projection
conforms to the Schema
string.
ParseException
public static void setProjection(org.apache.hadoop.mapreduce.JobContext jobContext, ZebraProjection projection) throws ParseException
jobContext
- JobContext object.projection
- A common separated list of column names. If we want select all
columns, pass projection==null. The syntax of the projection
conforms to the Schema
string.
ParseException
public static String getProjection(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException, ParseException
jobContext
- The JobContext object
IOException
ParseException
public static SortInfo getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException
jobContext
- JobContext object
IOException
public static void requireSortedTable(org.apache.hadoop.mapreduce.JobContext jobContext, ZebraSortInfo sortInfo) throws IOException
jobContext
- JobContext object.sortInfo
- ZebraSortInfo object containing sorting information.
IOException
public static void setSplitMode(org.apache.hadoop.mapreduce.JobContext jobContext, TableInputFormat.SplitMode sm, ZebraSortInfo sortInfo) throws IOException
conf
- JonConf objectsm
- Split mode: unsorted, globally sorted, locally sorted. Default is unsortedsortInfo
- ZebraSortInfo object containing sorting information. Will be ignored if
the split mode is null or unsorted
IOException
public org.apache.hadoop.mapreduce.RecordReader<org.apache.hadoop.io.BytesWritable,Tuple> createRecordReader(org.apache.hadoop.mapreduce.InputSplit split, org.apache.hadoop.mapreduce.TaskAttemptContext taContext) throws IOException, InterruptedException
createRecordReader
in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
IOException
InterruptedException
InputFormat.createRecordReader(InputSplit, TaskAttemptContext)
public static TableRecordReader createTableRecordReader(org.apache.hadoop.mapreduce.JobContext jobContext, String projection) throws IOException, ParseException, InterruptedException
jobContext
- JobContext object.projection
- comma-separated column names in projection. null means all columns in projection
IOException
ParseException
InterruptedException
public static void setMinSplitSize(org.apache.hadoop.mapreduce.JobContext jobContext, long minSize)
jobContext
- The job conf object.minSize
- Minimum size.public List<org.apache.hadoop.mapreduce.InputSplit> getSplits(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException
getSplits
in class org.apache.hadoop.mapreduce.InputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
IOException
InputFormat.getSplits(JobContext)
@Deprecated public void validateInput(org.apache.hadoop.mapreduce.JobContext jobContext) throws IOException
IOException
public static org.apache.hadoop.io.WritableComparable<?> getSortedTableSplitComparable(org.apache.hadoop.mapreduce.InputSplit inputSplit)
inputSplit
- An InputSplit instance. It should be type of SortedTableSplit.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |