Zebra and Streaming
Overview
Streaming allows you to write application logic in any langugage and to process large amounts of data using the Hadoop framework. Streaming, which traditionally works with text files, can now be used to process data stored as Zebra tables.
Configuration Variables
To use Zebra tables with your streaming applications, used the mapred.lib.table.input.projection variable to specify Zebra columns (fields).
bin/hadoop jar $streamingJar -D mapred.lib.table.input.projection="word, count"
Zebra Streaming Examples
In the following examples, TableInputFormat is used for the inputclass and the default TextOutputFormat is used for the outputclass.
Creating a Zebra Table
Suppose a data file, testfile, contains four fields.
en bbb1 1 1880 en bbb2 1 2000
You can use a simple Pig script to create a Zebra table, testfile-table. The table consists of one column group with four columns.
$ cat table-creator.pig
REGISTER $LOCATION/zebra-$version.jar;
testfile = LOAD 'testfile'
USING PigStorage(' ') AS (language:chararray, page:chararray, count:int, size:long);
STORE testfile INTO 'testfile-table'
USING org.apache.hadoop.zebra.pig.TableStorer('[language, page, count, size]');
Checking Serialization
This example is a map-only job that checks the serializtion. Note that each line starts with a tab since the key is an empty string for tables created by PIG (this changes with sorted tables).
$ bin/hadoop jar hadoop-0.20.2-dev-streaming.jar -D mapred.reduce.tasks=0 \
-input testfile-table -output output -mapper 'cat' \
-inputformat org.apache.hadoop.zebra.mapred.TableInputFormat
$ grep 'en' output/part-00000 | head
(en,bbb1,1,1880)
(en,bbb2,1,2000)
(en,bbb3,1,1950)
(en,bbb4,1,48900
Locating Frequently Visited Pages
This Perl script sorts the pages on number of page view counts. The script outputs space padded count so that string sorting results in correct output. The first TAB separates the key and value for Hadoop streaming.
while (<>) {
chomp;
s/.?\t(.*)$/$1/ or next; # ignore the key (if any) and remove braces
split ','; #comma seperated list.
# key is space padded 3rd column.
printf("%8d\t%s\n", $_[2], "@_") if @_ == 4; # without a projection
# printf("%8d\t%s\n", shift @_, join(',', @_)); # with projection="count, page"
}
Streaming command:
$ bin/hadoop jar hadoop-0.20.2-dev-streaming.jar
-input testfile-table -output output -mapper table-mapper.pl -reducer cat \
-inputformat org.apache.hadoop.zebra.mapred.TableInputFormat
Pages are printed in increasing order of page view counts.
$ tail output/part-00000
10 fr bbb1 10 5883
14 de bbb2 14 2120
20 it bbb3 20 229
45 ja bbb4 45 75
47 de bbb5 47 43488
63 en bbb6 63 2404
73 de bbb7 73 1090
129 en bbb8 129 31
188 en bbb9 188 37
222 en bbb10 222 469
Projecting Columns
Use projection to view only a few columns (fields) of a very large table. Modify the output line in the table-mapper.pl script as shown below and run the following streaming command:
$ bin/hadoop jar hadoop-0.20.2-dev-streaming.jar -D mapred.lib.table.input.projection="count,page" \
-input testfile-table -output output -mapper table-mapper.pl -reducer cat \
-inputformat org.apache.hadoop.zebra.mapred.TableInputFormat
$ tail output/part-00000
10 bbb1
14 bbb2
20 bbb3
45 bbb4
47 bbb5
63 bbb6
73 bbb7
129 bbb8
188 bbb9
222 bbb10


