BasicTableOutputFormat (Pig 0.9.3-SNAPSHOT API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.zebra.mapreduce
Class BasicTableOutputFormat

java.lang.Object
  org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
      org.apache.hadoop.zebra.mapreduce.BasicTableOutputFormat

public class BasicTableOutputFormat
extends org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>
extends org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>

OutputFormat class for creating a BasicTable. Usage Example:

In the main program, add the following code.

 job.setOutputFormatClass(BasicTableOutputFormat.class);
 Path outPath = new Path("path/to/the/BasicTable");
 BasicTableOutputFormat.setOutputPath(job, outPath);
 BasicTableOutputFormat.setSchema(job, "Name, Age, Salary, BonusPct");

The above code does the following things:

Set the output format class to BasicTableOutputFormat.
Set the single path to the BasicTable to be created.
Set the schema of the BasicTable to be created. In this case, the to-be-created BasicTable contains three columns with names "Name", "Age", "Salary", "BonusPct".

To create multiple output paths. ZebraOutputPartitoner interface needs to be implemented

 String multiLocs = "commaSeparatedPaths"    
 job.setOutputFormatClass(BasicTableOutputFormat.class);
 BasicTableOutputFormat.setMultipleOutputPaths(job, multiLocs);
 job.setOutputFormat(BasicTableOutputFormat.class);
 BasicTableOutputFormat.setSchema(job, "Name, Age, Salary, BonusPct");
 BasicTableOutputFormat.setZebraOutputPartitionClass(
                job, MultipleOutputsTest.OutputPartitionerClass.class);

The user ZebraOutputPartitionClass should like this

 
   static class OutputPartitionerClass implements ZebraOutputPartition {
   @Override
          public int getOutputPartition(BytesWritable key, Tuple value) {                

        return someIndexInOutputParitionlist0;
          }

The user Reducer code (or similarly Mapper code if it is a Map-only job) should look like the following:

 static class MyReduceClass implements Reducer<K, V, BytesWritable, Tuple> {
   // keep the tuple object for reuse.
   Tuple outRow;
   // indices of various fields in the output Tuple.
   int idxName, idxAge, idxSalary, idxBonusPct;
 
   @Override
   public void configure(Job job) {
     Schema outSchema = BasicTableOutputFormat.getSchema(job);
     // create a tuple that conforms to the output schema.
     outRow = TypesUtils.createTuple(outSchema);
     // determine the field indices.
     idxName = outSchema.getColumnIndex("Name");
     idxAge = outSchema.getColumnIndex("Age");
     idxSalary = outSchema.getColumnIndex("Salary");
     idxBonusPct = outSchema.getColumnIndex("BonusPct");
   }
 
   @Override
   public void reduce(K key, Iterator<V> values,
       OutputCollector<BytesWritable, Tuple> output, Reporter reporter)
       throws IOException {
     String name;
     int age;
     int salary;
     double bonusPct;
     // ... Determine the value of the individual fields of the row to be inserted.
     try {
       outTuple.set(idxName, name);
       outTuple.set(idxAge, new Integer(age));
       outTuple.set(idxSalary, new Integer(salary));
       outTuple.set(idxBonusPct, new Double(bonusPct));
       output.collect(new BytesWritable(name.getBytes()), outTuple);
     }
     catch (ExecException e) {
       // should never happen
     }
   }
 
   @Override
   public void close() throws IOException {
     // no-op
   }
 
 }

Constructor Summary
`BasicTableOutputFormat()`

Method Summary
`void`	`checkOutputSpecs(org.apache.hadoop.mapreduce.JobContext jobContext)` Note: we perform the Initialization of the table here.
`static void`	`close(org.apache.hadoop.mapreduce.JobContext jobContext)` Close the output BasicTable, No more rows can be added into the table.
`org.apache.hadoop.mapreduce.OutputCommitter`	`getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext taContext)`
`static String`	`getOutputPartitionClassArguments(org.apache.hadoop.conf.Configuration conf)` Get the output partition class arguments string from job configuration
`static org.apache.hadoop.fs.Path`	`getOutputPath(org.apache.hadoop.mapreduce.JobContext jobContext)` Get the output path of the BasicTable from JobContext
`static org.apache.hadoop.fs.Path[]`	`getOutputPaths(org.apache.hadoop.mapreduce.JobContext jobContext)` Get the multiple output paths of the BasicTable from JobContext
`org.apache.hadoop.mapreduce.RecordWriter<org.apache.hadoop.io.BytesWritable,Tuple>`	`getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taContext)`
`static Schema`	`getSchema(org.apache.hadoop.mapreduce.JobContext jobContext)` Get the table schema in JobContext.
`static SortInfo`	`getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext)` Get the SortInfo object
`static org.apache.hadoop.io.BytesWritable`	`getSortKey(Object builder, Tuple t)` Generates a BytesWritable key for the input key using keygenerate provided.
`static Object`	`getSortKeyGenerator(org.apache.hadoop.mapreduce.JobContext jobContext)` Generates a zebra specific sort key generator which is used to generate BytesWritable key Sort Key(s) are used to generate this object
`static String`	`getStorageHint(org.apache.hadoop.mapreduce.JobContext jobContext)` Get the table storage hint in JobContext.
`static Class<? extends ZebraOutputPartition>`	`getZebraOutputPartitionClass(org.apache.hadoop.mapreduce.JobContext jobContext)`
`static void`	`setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext, Class<? extends ZebraOutputPartition> theClass, org.apache.hadoop.fs.Path... paths)` Set the multiple output paths of the BasicTable in JobContext
`static void`	`setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext, Class<? extends ZebraOutputPartition> theClass, String arguments, org.apache.hadoop.fs.Path... paths)` Set the multiple output paths of the BasicTable in JobContext
`static void`	`setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext, String commaSeparatedLocations, Class<? extends ZebraOutputPartition> theClass)` Deprecated. Use `#setMultipleOutputs(JobContext, class, Path ...)` instead.
`static void`	`setOutputPath(org.apache.hadoop.mapreduce.JobContext jobContext, org.apache.hadoop.fs.Path path)` Set the output path of the BasicTable in JobContext
`static void`	`setSchema(org.apache.hadoop.mapreduce.JobContext jobContext, String schema)` Deprecated. Use `setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo)` instead.
`static void`	`setSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext, String sortColumns)` Deprecated. Use `setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo)` instead.
`static void`	`setSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext, String sortColumns, Class<? extends org.apache.hadoop.io.RawComparator<Object>> comparatorClass)` Deprecated. Use `setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo)` instead.
`static void`	`setStorageHint(org.apache.hadoop.mapreduce.JobContext jobContext, String storehint)` Deprecated. Use `setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo)` instead.
`static void`	`setStorageInfo(org.apache.hadoop.mapreduce.JobContext jobContext, ZebraSchema zSchema, ZebraStorageHint zStorageHint, ZebraSortInfo zSortInfo)` Set the table storage info including ZebraSchema,

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

BasicTableOutputFormat

public BasicTableOutputFormat()

Method Detail

setMultipleOutputs

public static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext,
                                      String commaSeparatedLocations,
                                      Class<? extends ZebraOutputPartition> theClass)
                               throws IOException

Deprecated. Use #setMultipleOutputs(JobContext, class, Path ...) instead.

Set the multiple output paths of the BasicTable in JobContext

Parameters:: jobContext - The JobContext object.; commaSeparatedLocations - The comma separated output paths to the tables. The path must either not existent, or must be an empty directory.; theClass - Zebra output partitioner class
Throws:: IOException

setMultipleOutputs

public static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext,
                                      Class<? extends ZebraOutputPartition> theClass,
                                      org.apache.hadoop.fs.Path... paths)
                               throws IOException

Set the multiple output paths of the BasicTable in JobContext

Parameters:: jobContext - The JobContext object.; theClass - Zebra output partitioner class; paths - The list of paths The path must either not existent, or must be an empty directory.
Throws:: IOException

setMultipleOutputs

public static void setMultipleOutputs(org.apache.hadoop.mapreduce.JobContext jobContext,
                                      Class<? extends ZebraOutputPartition> theClass,
                                      String arguments,
                                      org.apache.hadoop.fs.Path... paths)
                               throws IOException

Set the multiple output paths of the BasicTable in JobContext

Parameters:: jobContext - The JobContext object.; theClass - Zebra output partitioner class; arguments - Arguments string to partitioner class; paths - The list of paths The path must either not existent, or must be an empty directory.
Throws:: IOException

getOutputPartitionClassArguments

public static String getOutputPartitionClassArguments(org.apache.hadoop.conf.Configuration conf)

Get the output partition class arguments string from job configuration

Parameters:: conf - The job configuration object.
Returns:: the output partition class arguments string.

getOutputPaths

public static org.apache.hadoop.fs.Path[] getOutputPaths(org.apache.hadoop.mapreduce.JobContext jobContext)
                                                  throws IOException

Get the multiple output paths of the BasicTable from JobContext

Parameters:: jobContext - The JobContext object.
Returns:: path The comma separated output paths to the tables. The path must either not existent, or must be an empty directory.
Throws:: IOException

getZebraOutputPartitionClass

public static Class<? extends ZebraOutputPartition> getZebraOutputPartitionClass(org.apache.hadoop.mapreduce.JobContext jobContext)
                                                                          throws IOException

Throws:: IOException

setOutputPath

public static void setOutputPath(org.apache.hadoop.mapreduce.JobContext jobContext,
                                 org.apache.hadoop.fs.Path path)

Set the output path of the BasicTable in JobContext

Parameters:: jobContext - The JobContext object.; path - The output path to the table. The path must either not existent, or must be an empty directory.

getOutputPath

public static org.apache.hadoop.fs.Path getOutputPath(org.apache.hadoop.mapreduce.JobContext jobContext)

Get the output path of the BasicTable from JobContext

Parameters:: jobContext - jobContext object
Returns:: The output path.

setSchema

public static void setSchema(org.apache.hadoop.mapreduce.JobContext jobContext,
                             String schema)

Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.

Set the table schema in JobContext

Parameters:: jobContext - The JobContext object.; schema - The schema of the BasicTable to be created. For the initial implementation, the schema string is simply a comma separated list of column names, such as "Col1, Col2, Col3".

getSchema

public static Schema getSchema(org.apache.hadoop.mapreduce.JobContext jobContext)
                        throws ParseException

Get the table schema in JobContext.

Parameters:: jobContext - The JobContext object.
Returns:: The output schema of the BasicTable. If the schema is not defined in the jobContext object at the time of the call, null will be returned.
Throws:: ParseException

getSortKeyGenerator

public static Object getSortKeyGenerator(org.apache.hadoop.mapreduce.JobContext jobContext)
                                  throws IOException,
                                         ParseException

Generates a zebra specific sort key generator which is used to generate BytesWritable key Sort Key(s) are used to generate this object

Parameters:: jobContext - The JobContext object.
Returns:: Object of type zebra.pig.comaprator.KeyGenerator.
Throws:: IOException; ParseException

getSortKey

public static org.apache.hadoop.io.BytesWritable getSortKey(Object builder,
                                                            Tuple t)
                                                     throws Exception

Generates a BytesWritable key for the input key using keygenerate provided. Sort Key(s) are used to generate this object

Parameters:: builder - Opaque key generator created by getSortKeyGenerator() method; t - Tuple to create sort key from
Returns:: ByteWritable Key
Throws:: Exception

setStorageHint

public static void setStorageHint(org.apache.hadoop.mapreduce.JobContext jobContext,
                                  String storehint)
                           throws ParseException,
                                  IOException

Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.

Set the table storage hint in JobContext, should be called after setSchema is called.

Note that the "secure by" feature is experimental now and subject to changes in the future.

Parameters:: jobContext - The JobContext object.; storehint - The storage hint of the BasicTable to be created. The format would be like "[f1, f2.subfld]; [f3, f4]".
Throws:: ParseException; IOException

getStorageHint

public static String getStorageHint(org.apache.hadoop.mapreduce.JobContext jobContext)

Get the table storage hint in JobContext.

Parameters:: jobContext - The JobContext object.
Returns:: The storage hint of the BasicTable. If the storage hint is not defined in the jobContext object at the time of the call, an empty string will be returned.

setSortInfo

public static void setSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext,
                               String sortColumns,
                               Class<? extends org.apache.hadoop.io.RawComparator<Object>> comparatorClass)

Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.

Set the sort info

Parameters:: jobContext - The JobContext object.; sortColumns - Comma-separated sort column names; comparatorClass - comparator class name; null for default

setSortInfo

public static void setSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext,
                               String sortColumns)

Deprecated. Use setStorageInfo(JobContext, ZebraSchema, ZebraStorageHint, ZebraSortInfo) instead.

Set the sort info

Parameters:: jobContext - The JobContext object.; sortColumns - Comma-separated sort column names

setStorageInfo

public static void setStorageInfo(org.apache.hadoop.mapreduce.JobContext jobContext,
                                  ZebraSchema zSchema,
                                  ZebraStorageHint zStorageHint,
                                  ZebraSortInfo zSortInfo)
                           throws ParseException,
                                  IOException

Set the table storage info including ZebraSchema,

Parameters:: jobcontext - The JobContext object.; zSchema - The ZebraSchema object containing schema information.; zStorageHint - The ZebraStorageHint object containing storage hint information.; zSortInfo - The ZebraSortInfo object containing sorting information.
Throws:: ParseException; IOException

getSortInfo

public static SortInfo getSortInfo(org.apache.hadoop.mapreduce.JobContext jobContext)
                            throws IOException

Get the SortInfo object

Parameters:: jobContext - The JobContext object.
Returns:: SortInfo object; null if the Zebra table is unsorted
Throws:: IOException

checkOutputSpecs

public void checkOutputSpecs(org.apache.hadoop.mapreduce.JobContext jobContext)
                      throws IOException

Note: we perform the Initialization of the table here. So we expect this to be called before BasicTableOutputFormat#getRecordWriter(FileSystem, JobContext, String, Progressable)

Specified by:: checkOutputSpecs in class org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>

Throws:: IOException
See Also:: OutputFormat.checkOutputSpecs(JobContext)

getRecordWriter

public org.apache.hadoop.mapreduce.RecordWriter<org.apache.hadoop.io.BytesWritable,Tuple> getRecordWriter(org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
                                                                                                   throws IOException

Specified by:: getRecordWriter in class org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>

Throws:: IOException
See Also:: OutputFormat.getRecordWriter(TaskAttemptContext)

close

public static void close(org.apache.hadoop.mapreduce.JobContext jobContext)
                  throws IOException

Close the output BasicTable, No more rows can be added into the table. A BasicTable is not visible for reading until it is "closed".

Parameters:: jobContext - The JobContext object.
Throws:: IOException

getOutputCommitter

public org.apache.hadoop.mapreduce.OutputCommitter getOutputCommitter(org.apache.hadoop.mapreduce.TaskAttemptContext taContext)
                                                               throws IOException,
                                                                      InterruptedException

Specified by:: getOutputCommitter in class org.apache.hadoop.mapreduce.OutputFormat<org.apache.hadoop.io.BytesWritable,Tuple>

Throws:: IOException; InterruptedException

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.zebra.mapreduce Class BasicTableOutputFormat

BasicTableOutputFormat

setMultipleOutputs

setMultipleOutputs

setMultipleOutputs

getOutputPartitionClassArguments

getOutputPaths

getZebraOutputPartitionClass

setOutputPath

getOutputPath

setSchema

getSchema

getSortKeyGenerator

getSortKey

setStorageHint

getStorageHint

setSortInfo

setSortInfo

setStorageInfo

getSortInfo

checkOutputSpecs

getRecordWriter

close

getOutputCommitter

org.apache.hadoop.zebra.mapreduce
Class BasicTableOutputFormat