BasicTable.Reader (Pig 0.9.3-SNAPSHOT API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.zebra.io
Class BasicTable.Reader

java.lang.Object
  org.apache.hadoop.zebra.io.BasicTable.Reader

All Implemented Interfaces:: Closeable

Enclosing class:: BasicTable

public static class BasicTable.Reader
extends Object
implements Closeable
extends Object
implements Closeable

BasicTable reader.

Nested Class Summary
`static class`	`BasicTable.Reader.RangeSplit` A range-based split on the metaReadertable.The content of the split is implementation-dependent.
`static class`	`BasicTable.Reader.RowSplit` A row-based split on the zebra table;

Constructor Summary
`BasicTable.Reader(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf)` Create a BasicTable reader.
`BasicTable.Reader(org.apache.hadoop.fs.Path path, String[] deletedCGs, org.apache.hadoop.conf.Configuration conf)`

Method Summary
`void`	`close()` Close the BasicTable for reading.
`BlockDistribution`	`getBlockDistribution(BasicTable.Reader.RangeSplit split)` Given a split range, calculate how the file data that fall into the range are distributed among hosts.
`BlockDistribution`	`getBlockDistribution(BasicTable.Reader.RowSplit split)` Given a row-based split, calculate how the file data that fall into the split are distributed among hosts.
`String`	`getDeletedCGs()`
`static String`	`getDeletedCGs(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf)`
`KeyDistribution`	`getKeyDistribution(int n, int nTables, BlockDistribution lastBd)` Collect some key samples and use them to partition the table.
`DataInputStream`	`getMetaBlock(String name)` Obtain an input stream for reading a meta block.
`String`	`getName(int i)`
`String`	`getPath()` Get the path to the table.
`org.apache.hadoop.fs.PathFilter`	`getPathFilter(org.apache.hadoop.conf.Configuration conf)` Get the path filter used by the table.
`int`	`getRowSplitCGIndex()` Get index of the column group that will be used for row-based split.
`TableScanner`	`getScanner(BasicTable.Reader.RangeSplit split, boolean closeReader)` Get a scanner that reads a consecutive number of rows as defined in the `BasicTable.Reader.RangeSplit` object, which should be obtained from previous calls of `rangeSplit(int)`.
`TableScanner`	`getScanner(boolean closeReader, BasicTable.Reader.RowSplit rowSplit)` Get a scanner that reads a consecutive number of rows as defined in the `BasicTable.Reader.RowSplit` object.
`TableScanner`	`getScanner(org.apache.hadoop.io.BytesWritable beginKey, org.apache.hadoop.io.BytesWritable endKey, boolean closeReader)` Get a scanner that reads all rows whose row keys fall in a specific range.
`Schema`	`getSchema()` Get the schema of the table.
`static Schema`	`getSchema(org.apache.hadoop.fs.Path path, org.apache.hadoop.conf.Configuration conf)` Get the BasicTable schema without loading the full table index.
`SortInfo`	`getSortInfo()`
`BasicTableStatus`	`getStatus()` Get the status of the BasicTable.
`boolean`	`isSorted()` Is the Table sorted?
`List<BasicTable.Reader.RangeSplit>`	`rangeSplit(int n)` Split the table into at most n parts.
`void`	`rearrangeFileIndices(org.apache.hadoop.fs.FileStatus[] fileStatus)` Rearrange the files according to the column group index ordering
`List<BasicTable.Reader.RowSplit>`	`rowSplit(long[] starts, long[] lengths, org.apache.hadoop.fs.Path[] paths, int splitCGIndex, int[] batchSizes, int numBatches)` We already use FileInputFormat to create byte offset-based input splits.
`void`	`setProjection(String projection)` Set the projection for the reader.

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

BasicTable.Reader

public BasicTable.Reader(org.apache.hadoop.fs.Path path,
                         org.apache.hadoop.conf.Configuration conf)
                  throws IOException

Create a BasicTable reader.

Parameters:: path - The directory path to the BasicTable.; conf - Optional configuration parameters.
Throws:: IOException

BasicTable.Reader

public BasicTable.Reader(org.apache.hadoop.fs.Path path,
                         String[] deletedCGs,
                         org.apache.hadoop.conf.Configuration conf)
                  throws IOException

Throws:: IOException

Method Detail

isSorted

public boolean isSorted()

Is the Table sorted?

Returns:: Whether the table is sorted.

getSortInfo

public SortInfo getSortInfo()

Returns:: the list of sorted columns

getName

public String getName(int i)

Returns:: the name of i-th column group

setProjection

public void setProjection(String projection)
                   throws ParseException,
                          IOException

Set the projection for the reader. This will affect calls to getScanner(RangeSplit, boolean), getScanner(BytesWritable, BytesWritable, boolean), getStatus(), getSchema().

Parameters:: projection - The projection on the BasicTable for subsequent read operations. For this version of implementation, the projection is a comma separated list of column names, such as "FirstName, LastName, Sex, Department". If we want select all columns, pass projection==null.
Throws:: IOException; ParseException

getStatus

public BasicTableStatus getStatus()
                           throws IOException

Get the status of the BasicTable.

Throws:: IOException

getBlockDistribution

public BlockDistribution getBlockDistribution(BasicTable.Reader.RangeSplit split)
                                       throws IOException

Given a split range, calculate how the file data that fall into the range are distributed among hosts.

Parameters:: split - The range-based split. Can be null to indicate the whole TFile.
Returns:: An object that conveys how blocks fall in the split are distributed across hosts.
Throws:: IOException
See Also:: rangeSplit(int)

getBlockDistribution

public BlockDistribution getBlockDistribution(BasicTable.Reader.RowSplit split)
                                       throws IOException

Given a row-based split, calculate how the file data that fall into the split are distributed among hosts.

Parameters:: split - The row-based split. Cannot be null.
Returns:: An object that conveys how blocks fall into the split are distributed across hosts.
Throws:: IOException

getKeyDistribution

public KeyDistribution getKeyDistribution(int n,
                                          int nTables,
                                          BlockDistribution lastBd)
                                   throws IOException

Collect some key samples and use them to partition the table. Only applicable to sorted BasicTable. The returned KeyDistribution object also contains information on how data are distributed for each key-partitioned bucket.

Parameters:: n - Targeted size of the sampling.; nTables - Number of tables in union
Returns:: KeyDistribution object.
Throws:: IOException

getScanner

public TableScanner getScanner(org.apache.hadoop.io.BytesWritable beginKey,
                               org.apache.hadoop.io.BytesWritable endKey,
                               boolean closeReader)
                        throws IOException

Get a scanner that reads all rows whose row keys fall in a specific range. Only applicable to sorted BasicTable.

Parameters:: beginKey - The begin key of the scan range. If null, start from the first row in the table.; endKey - The end key of the scan range. If null, scan till the last row in the table.; closeReader - close the underlying Reader object when we close the scanner. Should be set to true if we have only one scanner on top of the reader, so that we should release resources after the scanner is closed.
Returns:: A scanner object.
Throws:: IOException

getScanner

public TableScanner getScanner(BasicTable.Reader.RangeSplit split,
                               boolean closeReader)
                        throws IOException,
                               ParseException

Get a scanner that reads a consecutive number of rows as defined in the BasicTable.Reader.RangeSplit object, which should be obtained from previous calls of rangeSplit(int).

Parameters:: split - The split range. If null, get a scanner to read the complete table.; closeReader - close the underlying Reader object when we close the scanner. Should be set to true if we have only one scanner on top of the reader, so that we should release resources after the scanner is closed.
Returns:: A scanner object.
Throws:: IOException; ParseException

getScanner

public TableScanner getScanner(boolean closeReader,
                               BasicTable.Reader.RowSplit rowSplit)
                        throws IOException,
                               ParseException,
                               ParseException

Get a scanner that reads a consecutive number of rows as defined in the BasicTable.Reader.RowSplit object.

Parameters:: closeReader - close the underlying Reader object when we close the scanner. Should be set to true if we have only one scanner on top of the reader, so that we should release resources after the scanner is closed.; rowSplit - split based on row numbers.
Returns:: A scanner object.
Throws:: IOException; ParseException

getSchema

public Schema getSchema()

Get the schema of the table. The schema may be different from getSchema(Path, Configuration) if a projection has been set on the table.

Returns:: The schema of the BasicTable.

getSchema

public static Schema getSchema(org.apache.hadoop.fs.Path path,
                               org.apache.hadoop.conf.Configuration conf)
                        throws IOException

Get the BasicTable schema without loading the full table index.

Parameters:: path - The path to the BasicTable.; conf -
Returns:: The logical Schema of the table (all columns).
Throws:: IOException

getPath

public String getPath()

Get the path to the table.

Returns:: The path string to the table.

getPathFilter

public org.apache.hadoop.fs.PathFilter getPathFilter(org.apache.hadoop.conf.Configuration conf)

Get the path filter used by the table.

rangeSplit

public List<BasicTable.Reader.RangeSplit> rangeSplit(int n)
                                              throws IOException

Split the table into at most n parts.

Parameters:: n - Maximum number of parts in the output list.
Returns:: A list of RangeSplit objects, each of which can be used to construct TableScanner later.
Throws:: IOException

rowSplit

public List<BasicTable.Reader.RowSplit> rowSplit(long[] starts,
                                                 long[] lengths,
                                                 org.apache.hadoop.fs.Path[] paths,
                                                 int splitCGIndex,
                                                 int[] batchSizes,
                                                 int numBatches)
                                          throws IOException

We already use FileInputFormat to create byte offset-based input splits. Their information is encoded in starts, lengths and paths. This method is to wrap this information to form RowSplit objects at basic table level.

Parameters:: starts - array of starting byte of fileSplits.; lengths - array of length of fileSplits.; paths - array of path of fileSplits.; splitCGIndex - index of column group that is used to create fileSplits.
Returns:: A list of RowSplit objects, each of which can be used to construct a TableScanner later.
Throws:: IOException

rearrangeFileIndices

public void rearrangeFileIndices(org.apache.hadoop.fs.FileStatus[] fileStatus)
                          throws IOException

Rearrange the files according to the column group index ordering

Parameters:: filestatus - array of FileStatus to be rearraged on
Throws:: IOException

getRowSplitCGIndex

public int getRowSplitCGIndex()
                       throws IOException

Get index of the column group that will be used for row-based split.

Throws:: IOException

close

public void close()
           throws IOException

Close the BasicTable for reading. Resources are released.

Specified by:: close in interface Closeable

Throws:: IOException

getDeletedCGs

public String getDeletedCGs()

getDeletedCGs

public static String getDeletedCGs(org.apache.hadoop.fs.Path path,
                                   org.apache.hadoop.conf.Configuration conf)
                            throws IOException

Throws:: IOException

getMetaBlock

public DataInputStream getMetaBlock(String name)
                             throws MetaBlockDoesNotExist,
                                    IOException

Obtain an input stream for reading a meta block.

Parameters:: name - The name of the meta block.
Returns:: The input stream for reading the meta block.
Throws:: IOException; MetaBlockDoesNotExist

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

org.apache.hadoop.zebra.io Class BasicTable.Reader

BasicTable.Reader

BasicTable.Reader

isSorted

getSortInfo

getName

setProjection

getStatus

getBlockDistribution

getBlockDistribution

getKeyDistribution

getScanner

getScanner

getScanner

getSchema

getSchema

getPath

getPathFilter

rangeSplit

rowSplit

rearrangeFileIndices

getRowSplitCGIndex

close

getDeletedCGs

getDeletedCGs

getMetaBlock

org.apache.hadoop.zebra.io
Class BasicTable.Reader