Package org.apache.pig.data

This package contains implementations of Pig specific data types as well as support functions for reading, writing, and using all Pig data types.

See:
          Description

Interface Summary
DataBag A collection of Tuples.
InterSedes A class to handle reading and writing of intermediate results of data types.
Tuple An ordered list of Data.
TupleRawComparator This interface is intended to compare Tuples.
 

Class Summary
AccumulativeBag  
AmendableTuple  
BagFactory Factory for constructing different types of bags.
BinInterSedes A class to handle reading and writing of intermediate results of data types.
BinInterSedes.BinInterSedesTupleRawComparator  
BinSedesTuple This tuple has a faster (de)serialization mechanism.
BinSedesTupleFactory Default implementation of TupleFactory.
DataByteArray An implementation of byte array.
DataReaderWriter This class was used to handle reading and writing of intermediate results of data types.
DataType A class of static final values used to encode data type and a number of static helper funcitons for manipulating data objects.
DefaultAbstractBag Default implementation of DataBag.
DefaultAbstractBag.BagDelimiterTuple  
DefaultAbstractBag.EndBag  
DefaultAbstractBag.StartBag  
DefaultBagFactory Default implementation of BagFactory.
DefaultDataBag An unordered collection of Tuples (possibly) with multiples.
DefaultTuple A default implementation of Tuple.
DefaultTuple.DefaultTupleRawComparator  
DefaultTupleFactory Deprecated. Use TupleFactory
DistinctDataBag An unordered collection of Tuples with no multiples.
FileList This class extends ArrayList to add a finalize() that calls delete on the files .
InternalCachedBag  
InternalDistinctBag An unordered collection of Tuples with no multiples.
InternalMap This class is an empty extension of Map.
InternalSortedBag An ordered collection of Tuples (possibly) with multiples.
InterSedesFactory Used to get hold of the single instance of InterSedes .
NonSpillableDataBag An unordered collection of Tuples (possibly) with multiples.
ReadOnceBag This bag is specifically created for use by POPackageLite.
SingleTupleBag A simple performant implementation of the DataBag interface which only holds a single tuple.
SortedDataBag An ordered collection of Tuples (possibly) with multiples.
SortedSpillBag Common functionality for proactively spilling bags that need to keep the data sorted.
TargetedTuple A tuple composed with the operators to which it needs be attached
TimestampedTuple  
TupleFactory A factory to construct tuples.
 

Package org.apache.pig.data Description

This package contains implementations of Pig specific data types as well as support functions for reading, writing, and using all Pig data types.

Whenever possible, Pig utilizes Java provided data types. These include Integer, Long, Float, Double, Boolean, String, and Map. Tuple, Bag, and DataByteArray are implemented in this package.

Design

The choice was made to utilize Java provided types for two main reasons. One, it minimizes the burden on UDF developers, as they will have full access to these types with no need to convert to and from Pig specific types. Two, maintenance costs will be lower as there is no need to implement and maintain Pig specific data classes. The drawback is that the only common parent of all these types is Object. Thus Pig is often required to treat its data objects as Objects and then implement static methods to manipulate these Objects, rather than being able to define a PigDatum class with common funcitons.

Three data types were implemented as Pig specific classes: DataByteArray, Tuple, and DataBag.

DataByteArray represents an array of bytes, with no interpretation of those bytes provided or assumed. This could have been represented as byte[], but a separate class was constructed to provide common functions needed to manipulate these objects.

Tuple represents an ordered collection of data elements. Every field in a tuple can contain any Pig data type. Tuple is presented as an interface to allow differing implementations in cases where users have unique representations of their data that they wish to preserve in their in memory representations. The TupleFactory is an abstract class, to enable a user who has defined his own tuples to provide a factory that creates those tuples. Default implementations of Tuple and TupleFactory are provided and used by default.

DataBag represents a collection of Tuples. DataBags can be of default type (no extra features), sorted (tuples are sorted according to a provided comparator function), or distinct (no duplicate tuples). As with Tuple, DataBag is presented as an interface, and BagFactory is an abstract class. Default implementations of DataBag, BagFactory, and all three types of bags are provided.



Copyright © 2012 The Apache Software Foundation