org.apache.hadoop.zebra.mapreduce
Class BasicTableOutputFormat

java.lang.Object
  extended by org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
      extended by org.apache.hadoop.zebra.mapreduce.BasicTableOutputFormat

public class BasicTableOutputFormat
extends org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>

OutputFormat class for creating a BasicTable. Usage Example:

In the main program, add the following code.

 job.setOutputFormatClass(BasicTableOutputFormat.class);
 Path outPath = new Path("path/to/the/BasicTable");
 BasicTableOutputFormat.setOutputPath(job, outPath);
 BasicTableOutputFormat.setSchema(job, "Name, Age, Salary, BonusPct");
 
The above code does the following things: To create multiple output paths. ZebraOutputPartitoner interface needs to be implemented
 String multiLocs = "commaSeparatedPaths"    
 job.setOutputFormatClass(BasicTableOutputFormat.class);
 BasicTableOutputFormat.setMultipleOutputPaths(job, multiLocs);
 job.setOutputFormat(BasicTableOutputFormat.class);
 BasicTableOutputFormat.setSchema(job, "Name, Age, Salary, BonusPct");
 BasicTableOutputFormat.setZebraOutputPartitionClass(
                job, MultipleOutputsTest.OutputPartitionerClass.class);
 
The user ZebraOutputPartitionClass should like this
 
   static class OutputPartitionerClass implements ZebraOutputPartition {
   @Override
          public int getOutputPartition(BytesWritable key, Tuple value) {                

        return someIndexInOutputParitionlist0;
          }
 
 
The user Reducer code (or similarly Mapper code if it is a Map-only job) should look like the following:
 static class MyReduceClass implements Reducer<K, V, BytesWritable, Tuple> {
   // keep the tuple object for reuse.
   Tuple outRow;
   // indices of various fields in the output Tuple.
   int idxName, idxAge, idxSalary, idxBonusPct;
 
   @Override
   public void configure(Job job) {
     Schema outSchema = BasicTableOutputFormat.getSchema(job);
     // create a tuple that conforms to the output schema.
     outRow = TypesUtils.createTuple(outSchema);
     // determine the field indices.
     idxName = outSchema.getColumnIndex("Name");
     idxAge = outSchema.getColumnIndex("Age");
     idxSalary = outSchema.getColumnIndex("Salary");
     idxBonusPct = outSchema.getColumnIndex("BonusPct");
   }
 
   @Override
   public void reduce(K key, Iterator<V> values,
       OutputCollector<BytesWritable, Tuple> output, Reporter reporter)
       throws IOException {
     String name;
     int age;
     int salary;
     double bonusPct;
     // ... Determine the value of the individual fields of the row to be inserted.
     try {
       outTuple.set(idxName, name);
       outTuple.set(idxAge, new Integer(age));
       outTuple.set(idxSalary, new Integer(salary));
       outTuple.set(idxBonusPct, new Double(bonusPct));
       output.collect(new BytesWritable(name.getBytes()), outTuple);
     }
     catch (ExecException e) {
       // should never happen
     }
   }
 
   @Override
   public void close() throws IOException {
     // no-op
   }
 
 }
 


Constructor Summary
BasicTableOutputFormat()
           
 
Method Summary
 void checkOutputSpecs(org.apache.hadoop.mapreduce.JobContext jobContext)
          Note: we perform the Initialization of the table here.
static void close(org.apache.hadoop.mapreduce.JobContext jobContext)
          Close the output BasicTable, No more rows can be added into the table.
 org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
           
static String getOutputPartitionClassArguments(org.apache.hadoop.conf.Configuration conf)
          Get the output partition class arguments string from job configuration
static org.apache.hadoop.fs.Path getOutputPath(org.apache.hadoop.mapreduce.JobContext jobContext)
          Get the output path of the BasicTable from JobContext
static org.apache.hadoop.fs.Path[] getOutputPaths(org.apache.hadoop.mapreduce.JobContext jobContext)
          Get the multiple output paths of the BasicTable from JobContext
 org.apache.hadoop.mapreduce.RecordWriter<org.apache.hadoop.io.BytesWritable,Tuple> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
           
static Schema getSchema(org.apache.hadoop.mapreduce.JobContext jobContext)
          Get the table schema in JobContext.
static SortInfo getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext)
          Get the SortInfo object
static org.apache.hadoop.io.BytesWritable getSortKey(Object builder, Tuple t)
          Generates a BytesWritable key for the input key using keygenerate provided.
static Object getSortKeyGenerator(org.apache.hadoop.mapreduce.JobContext jobContext)
          Generates a zebra specific sort key generator which is used to generate BytesWritable key Sort Key(s) are used to generate this object
static String getStorageHint(org.apache.hadoop.mapreduce.JobContext jobContext)
          Get the table storage hint in JobContext.
static Class<? extends ZebraOutputPartition> getZebraOutputPartitionClass(org.apache.hadoop.mapreduce.JobContext jobContext)
           
static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext, Class<? extends ZebraOutputPartition> theClass, org.apache.hadoop.fs.Path... paths)
          Set the multiple output paths of the BasicTable in JobContext
static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext, Class<? extends ZebraOutputPartition> theClass, String arguments, org.apache.hadoop.fs.Path... paths)
          Set the multiple output paths of the BasicTable in JobContext
static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext, String commaSeparatedLocations, Class<? extends ZebraOutputPartition> theClass)
          Deprecated. Use #setMultipleOutputs(JobContext, class, Path ...) instead.
static void setOutputPath(org.apache.hadoop.mapreduce.JobContext jobContext, org.apache.hadoop.fs.Path path)
          Set the output path of the BasicTable in JobContext
static void setSchema(org.apache.hadoop.mapreduce.JobContext jobContext, String schema)
          Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.
static void setSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext, String sortColumns)
          Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.
static void setSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext, String sortColumns, Class<? extends org.apache.hadoop.io.RawComparator<Object>> comparatorClass)
          Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.
static void setStorageHint(org.apache.hadoop.mapreduce.JobContext jobContext, String storehint)
          Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.
static void setStorageInfo(org.apache.hadoop.mapreduce.JobContext jobContext, ZebraSchema zSchema, ZebraStorageHint zStorageHint, ZebraSortInfo zSortInfo)
          Set the table storage info including ZebraSchema,
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BasicTableOutputFormat

public BasicTableOutputFormat()
Method Detail

setMultipleOutputs

public static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext,
                                      String commaSeparatedLocations,
                                      Class<? extends ZebraOutputPartition> theClass)
                               throws IOException
Deprecated. Use #setMultipleOutputs(JobContext, class, Path ...) instead.

Set the multiple output paths of the BasicTable in JobContext

Parameters:
jobContext - The JobContext object.
commaSeparatedLocations - The comma separated output paths to the tables. The path must either not existent, or must be an empty directory.
theClass - Zebra output partitioner class
Throws:
IOException

setMultipleOutputs

public static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext,
                                      Class<? extends ZebraOutputPartition> theClass,
                                      org.apache.hadoop.fs.Path... paths)
                               throws IOException
Set the multiple output paths of the BasicTable in JobContext

Parameters:
jobContext - The JobContext object.
theClass - Zebra output partitioner class
paths - The list of paths The path must either not existent, or must be an empty directory.
Throws:
IOException

setMultipleOutputs

public static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext,
                                      Class<? extends ZebraOutputPartition> theClass,
                                      String arguments,
                                      org.apache.hadoop.fs.Path... paths)
                               throws IOException
Set the multiple output paths of the BasicTable in JobContext

Parameters:
jobContext - The JobContext object.
theClass - Zebra output partitioner class
arguments - Arguments string to partitioner class
paths - The list of paths The path must either not existent, or must be an empty directory.
Throws:
IOException

getOutputPartitionClassArguments

public static String getOutputPartitionClassArguments(org.apache.hadoop.conf.Configuration conf)
Get the output partition class arguments string from job configuration

Parameters:
conf - The job configuration object.
Returns:
the output partition class arguments string.

getOutputPaths

public static org.apache.hadoop.fs.Path[] getOutputPaths(org.apache.hadoop.mapreduce.JobContext jobContext)
                                                  throws IOException
Get the multiple output paths of the BasicTable from JobContext

Parameters:
jobContext - The JobContext object.
Returns:
path The comma separated output paths to the tables. The path must either not existent, or must be an empty directory.
Throws:
IOException

getZebraOutputPartitionClass

public static Class<? extends ZebraOutputPartition> getZebraOutputPartitionClass(org.apache.hadoop.mapreduce.JobContext jobContext)
                                                                          throws IOException
Throws:
IOException

setOutputPath

public static void setOutputPath(org.apache.hadoop.mapreduce.JobContext jobContext,
                                 org.apache.hadoop.fs.Path path)
Set the output path of the BasicTable in JobContext

Parameters:
jobContext - The JobContext object.
path - The output path to the table. The path must either not existent, or must be an empty directory.

getOutputPath

public static org.apache.hadoop.fs.Path getOutputPath(org.apache.hadoop.mapreduce.JobContext jobContext)
Get the output path of the BasicTable from JobContext

Parameters:
jobContext - jobContext object
Returns:
The output path.

setSchema

public static void setSchema(org.apache.hadoop.mapreduce.JobContext jobContext,
                             String schema)
Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.

Set the table schema in JobContext

Parameters:
jobContext - The JobContext object.
schema - The schema of the BasicTable to be created. For the initial implementation, the schema string is simply a comma separated list of column names, such as "Col1, Col2, Col3".

getSchema

public static Schema getSchema(org.apache.hadoop.mapreduce.JobContext jobContext)
                        throws ParseException
Get the table schema in JobContext.

Parameters:
jobContext - The JobContext object.
Returns:
The output schema of the BasicTable. If the schema is not defined in the jobContext object at the time of the call, null will be returned.
Throws:
ParseException

getSortKeyGenerator

public static Object getSortKeyGenerator(org.apache.hadoop.mapreduce.JobContext jobContext)
                                  throws IOException,
                                         ParseException
Generates a zebra specific sort key generator which is used to generate BytesWritable key Sort Key(s) are used to generate this object

Parameters:
jobContext - The JobContext object.
Returns:
Object of type zebra.pig.comaprator.KeyGenerator.
Throws:
IOException
ParseException

getSortKey

public static org.apache.hadoop.io.BytesWritable getSortKey(Object builder,
                                                            Tuple t)
                                                     throws Exception
Generates a BytesWritable key for the input key using keygenerate provided. Sort Key(s) are used to generate this object

Parameters:
builder - Opaque key generator created by getSortKeyGenerator() method
t - Tuple to create sort key from
Returns:
ByteWritable Key
Throws:
Exception

setStorageHint

public static void setStorageHint(org.apache.hadoop.mapreduce.JobContext jobContext,
                                  String storehint)
                           throws ParseException,
                                  IOException
Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.

Set the table storage hint in JobContext, should be called after setSchema is called.

Note that the "secure by" feature is experimental now and subject to changes in the future.

Parameters:
jobContext - The JobContext object.
storehint - The storage hint of the BasicTable to be created. The format would be like "[f1, f2.subfld]; [f3, f4]".
Throws:
ParseException
IOException

getStorageHint

public static String getStorageHint(org.apache.hadoop.mapreduce.JobContext jobContext)
Get the table storage hint in JobContext.

Parameters:
jobContext - The JobContext object.
Returns:
The storage hint of the BasicTable. If the storage hint is not defined in the jobContext object at the time of the call, an empty string will be returned.

setSortInfo

public static void setSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext,
                               String sortColumns,
                               Class<? extends org.apache.hadoop.io.RawComparator<Object>> comparatorClass)
Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.

Set the sort info

Parameters:
jobContext - The JobContext object.
sortColumns - Comma-separated sort column names
comparatorClass - comparator class name; null for default

setSortInfo

public static void setSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext,
                               String sortColumns)
Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.

Set the sort info

Parameters:
jobContext - The JobContext object.
sortColumns - Comma-separated sort column names

setStorageInfo

public static void setStorageInfo(org.apache.hadoop.mapreduce.JobContext jobContext,
                                  ZebraSchema zSchema,
                                  ZebraStorageHint zStorageHint,
                                  ZebraSortInfo zSortInfo)
                           throws ParseException,
                                  IOException
Set the table storage info including ZebraSchema,

Parameters:
jobcontext - The JobContext object.
zSchema - The ZebraSchema object containing schema information.
zStorageHint - The ZebraStorageHint object containing storage hint information.
zSortInfo - The ZebraSortInfo object containing sorting information.
Throws:
ParseException
IOException

getSortInfo

public static SortInfo getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext)
                            throws IOException
Get the SortInfo object

Parameters:
jobContext - The JobContext object.
Returns:
SortInfo object; null if the Zebra table is unsorted
Throws:
IOException

checkOutputSpecs

public void checkOutputSpecs(org.apache.hadoop.mapreduce.JobContext jobContext)
                      throws IOException
Note: we perform the Initialization of the table here. So we expect this to be called before BasicTableOutputFormat#getRecordWriter(FileSystem, JobContext, String, Progressable)

Specified by:
checkOutputSpecs in class org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
Throws:
IOException
See Also:
OutputFormat.checkOutputSpecs(JobContext)

getRecordWriter

public org.apache.hadoop.mapreduce.RecordWriter<org.apache.hadoop.io.BytesWritable,Tuple> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
                                                                                                   throws IOException
Specified by:
getRecordWriter in class org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
Throws:
IOException
See Also:
OutputFormat.getRecordWriter(TaskAttemptContext)

close

public static void close(org.apache.hadoop.mapreduce.JobContext jobContext)
                  throws IOException
Close the output BasicTable, No more rows can be added into the table. A BasicTable is not visible for reading until it is "closed".

Parameters:
jobContext - The JobContext object.
Throws:
IOException

getOutputCommitter

public org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
                                                               throws IOException,
                                                                      InterruptedException
Specified by:
getOutputCommitter in class org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
Throws:
IOException
InterruptedException


Copyright © 2012 The Apache Software Foundation