This document is intended for system administrators who need to configure HDFS compression on Linux.
Linux supports GzipCodec, DefaultCodec, BZip2Codec,
LzoCodec, and SnappyCodec. Typically, GzipCodec
is used for HDFS compression.
Use the following instructions to use GZipCodec
Option I: To use GzipCodec with a one-time only job:
On the NameNode host machine, execute the following commands as
hdfsuser:hadoop jar hadoop-examples-1.1.0-SNAPSHOT.jar sort "-Dmapred.compress.map.output=true" "-Dmapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" "-Dmapred.output.compress=true" "-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec" -outKey org.apache.hadoop.io.Text -outValue org.apache.hadoop.io.Text input output
Option II: To enable GzipCodec as the default compression:
Edit the
core-site.xmlfile on the NameNode host machine:<property> <name>io.compression.codecs</name> <value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,org.apache.hadoop.io.compress.SnappyCodec</value> <description>A list of the compression codec classes that can be used for compression/decompression.</description> </property>Edit
mapred-site.xmlfile on the JobTracker host machine:<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>org.apache.hadoop.io.compress.GzipCodec</value> </property> <property> <name>mapred.output.compression.type</name> <value>BLOCK</value> </property>
[Optional] - Enable the following two configuration parameters to enable job output compression.
Edit
mapred-site.xmlfile on the Resource Manager host machine:<property> <name>mapred.output.compress</name> <value>true</value> </property> <property> <name>mapred.output.compression.codec</name> <value>org.apache.hadoop.io.compress.GzipCodec</value> </property>
Restart the cluster using the instructions provided on this page.

