org.apache.hadoop.zebra (Pig 0.9.3-SNAPSHOT API)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV PACKAGE NEXT PACKAGE

FRAMES NO FRAMES

Package org.apache.hadoop.zebra

Hadoop Table - tabular data storage for Hadoop MapReduce and PIG.

See:
Description

Package org.apache.hadoop.zebra Description

Hadoop Table - tabular data storage for Hadoop MapReduce and PIG.

Hadoop Table provides tabular-type data storage for Hadoop MapReduce Framework. It is also planned to allow Table to be closely integrated with PIG.

For this release, the basic construct of HadoopTable is called BasicTable. A BasicTable is a create-once, read-only kind of persisten data storage entity. A BasicTable contains zero or more keyed rows.

The API uses Hadoop BytesWritable objects to represent row keys, and PIG Tuple objects to represent rows.

Each BasicTable maintains a Schema , which, for this release, is nothing but a collection of column names. Given a schema, we can deduce the integer index of a particular column, and use it to extract (get) the desired datum from PIG Tuple object (which only allows index-based access).

Typically, applications use BasicTableOutputFormat (which implements the Hadoop OutputFormat interface) to create BasicTables through MapReduce. And they use TableInputFormat (which implements the Hadoop InputFormat to feed the data as their MapReduce input.

The API is structured in three packages:

org.apache.hadoop.zebra.mapreduce : The MapReduce layer. It contains two classes: BasicTableOutputFormat for creating BasicTable; and TableInputFormat for readding table.
org.apache.hadoop.zebra.types : Miscellaneous facilities that handle column types and tuple serializations. Currently, it is a place holder that redirects to PIG serialization. There is no type information being managed by Table for individual columns.
org.apache.hadoop.zebra.io : This is the internal IO layer. It deals with the physical storage (files) management of BasicTable. It also provides facilities to help MapReduce layer create splits, such as partitioning BasicTables for reading and reporting data block placement distributions based on range-partitions or key-partitions.