Configuring CDP Services for HDFS Encryption
This page contains recommendations for setting up HDFS Transparent Encryption with various CDP services.
HBase
Recommendations
Make /hbase an encryption zone. Do not create encryption zones as
subdirectories under /hbase, because HBase may need to rename files
across those subdirectories. When you create the encryption zone, name the key
hbase-key to take advantage of auto-generated KMS ACLs (Configuring
KMS Access Control Lists (ACLs)
).
Steps
On a cluster without HBase currently installed, create the
/hbase directory and make that an encryption zone.
- Stop the HBase service.
- Move data from the
/hbasedirectory to/hbase-tmp. - Create an empty
/hbasedirectory and make it an encryption zone. - Distcp all data from
/hbase-tmpto/hbase, preserving user-group permissions and extended attributes. - Start the HBase service and verify that it is working as expected.
- Remove the
/hbase-tmpdirectory.
KMS ACL Configuration for HBase
In the KMS ACL (Configuring KMS Access Control Lists (ACLs)
), grant the
hbase user and group DECRYPT_EEK permission for the
HBase key:
<property>
<name>key.acl.hbase-key.DECRYPT_EEK</name>
<value>hbase hbase</value>
</description>
</property>
Hive
HDFS encryption has been designed so that files cannot be moved from
one encryption zone to another or from encryption zones to unencrypted
directories. Therefore, the landing zone for data when using the
LOAD DATA INPATH command must always be inside the
destination encryption zone.
To use HDFS encryption with Hive, ensure you are using one of the following configurations:
Single Encryption Zone
With this configuration, you can use HDFS encryption by having all
Hive data inside the same encryption zone. In Cloudera Manager,
configure the Hive Scratch Directory
(hive.exec.scratchdir) to be inside the encryption
zone.
Recommended HDFS Path:
/user/hive
To use the auto-generated KMS ACLs (Configuring KMS Access Control Lists (ACLs)
),
make sure you name the encryption key hive-key.
For example, to configure a single encryption zone for the entire
Hive warehouse, you can rename /user/hive to
/user/hive-old, create an encryption zone at
/user/hive, and then distcp all
the data from /user/hive-old to
/user/hive.
In Cloudera Manager, configure the Hive Scratch Directory
(hive.exec.scratchdir) to be inside the encryption
zone by setting it to /user/hive/tmp, ensuring that
permissions are 1777 on
/user/hive/tmp.
Multiple Encryption Zones
With this configuration, you can use encrypted databases or tables with different encryption keys. To read data from read-only encrypted tables, users must have access to a temporary directory that is encrypted at least as strongly as the table.
For example:
- Configure two encrypted tables,
ezTbl1andezTbl2. - Create two new encryption zones,
/data/ezTbl1and/data/ezTbl2. - Load data to the tables in Hive using
LOADstatements.
For more information, see Changed Behavior after HDFS Encryption is Enabled.
Other Encrypted Directories
LOCALSCRATCHDIR: The MapJoin optimization in Hive writes HDFS tables to a local directory and then uploads them to the distributed cache. To ensure these files are encrypted, either disable MapJoin by settinghive.auto.convert.jointofalse, or encrypt the local Hive Scratch directory (hive.exec.local.scratchdir) using Cloudera Navigator Encrypt.DOWNLOADED_RESOURCES_DIR: JARs that are added to a user session and stored in HDFS are downloaded tohive.downloaded.resources.diron the HiveServer2 local filesystem. To encrypt these JAR files, configure Cloudera Navigator Encrypt to encrypt the directory specified byhive.downloaded.resources.dir.- NodeManager Local Directory List: Hive stores JARs and
MapJoin files in the distributed cache. To use MapJoin or encrypt
JARs and other resource files, the
yarn.nodemanager.local-dirsYARN configuration property must be configured to a set of encrypted local directories on all nodes.
Changed Behavior after HDFS Encryption is Enabled
- Loading data from one encryption zone to another results in a
copy of the data. Distcp is used to speed up the process if the size
of the files being copied is higher than the value specified by
HIVE_EXEC_COPYFILE_MAXSIZE. The minimum size limit forHIVE_EXEC_COPYFILE_MAXSIZEis 32 MB, which you can modify by changing the value for thehive.exec.copyfile.maxsizeconfiguration property. - When loading data to encrypted tables, Cloudera strongly recommends using a
landing zone inside the same encryption zone as the table.
- Example 1: Loading unencrypted data to an encrypted
table - Use one of the following methods:
- If you are loading new unencrypted data to an encrypted
table, use the
LOAD DATA ...statement. Because the source data is not inside the encryption zone, theLOADstatement results in a copy. For this reason, Cloudera recommends landing data that you need to encrypt inside the destination encryption zone. You can usedistcpto speed up the copying process if your data is inside HDFS. - If the data to be loaded is already inside a Hive table,
you can create a new table with a
LOCATIONinside an encryption zone as follows:
The location specified in theCREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <unencrypted_table>CREATE TABLEstatement must be inside an encryption zone. Creating a table pointingLOCATIONto an unencrypted directory does not encrypt your source data. You must copy your data to an encryption zone, and then pointLOCATIONto that zone.
- If you are loading new unencrypted data to an encrypted
table, use the
- Example 2: Loading encrypted data to an encrypted table
- If the data is already encrypted, use the
CREATE TABLEstatement pointingLOCATIONto the encrypted source directory containing the data. This is the fastest way to create encrypted tables.CREATE TABLE encrypted_table [STORED AS] LOCATION ... AS SELECT * FROM <encrypted_source_directory>
- Example 1: Loading unencrypted data to an encrypted
table - Use one of the following methods:
- Users reading data from encrypted tables that are read-only must have access to a temporary directory which is encrypted with at least as strong encryption as the table.
- Temporary data is now written to a directory named
.hive-stagingin each table or partition - Previously, an
INSERT OVERWRITEon a partitioned table inherited permissions for new data from the existing partition directory. With encryption enabled, permissions are inherited from the table.
KMS ACL Configuration for Hive
When Hive joins tables, it compares the encryption key strength for each table. For this
operation to succeed, you must configure the KMS ACLs (Configuring KMS Access Control
Lists (ACLs)
) to allow the hive user and group
READ access to the Hive key:
<property>
<name>key.acl.hive-key.READ</name>
<value>hive hive</value>
</property>
If you have restricted access to the GET_METADATA
operation, you must grant permission for it to the
hive user or group:
<property>
<name>hadoop.kms.acl.GET_METADATA</name>
<value>hive hive</value>
</property>
If you have disabled HiveServer2 Security Configuration
, you must configure the
KMS ACLs to grant DECRYPT_EEK permissions to the hive
user, as well as any user accessing data in the Hive warehouse.
Cloudera recommends creating a group containing all Hive users, and
granting DECRYPT_EEK access to that group.
For example, suppose user jdoe (home directory
/user/jdoe) is a Hive user and a member of the
group hive-users. The encryption zone (EZ) key for
/user/jdoe is named jdoe-key, and
the EZ key for /user/hive is
hive-key. The following ACL example demonstrates
the required permissions:
<property>
<name>key.acl.hive-key.DECRYPT_EEK</name>
<value>hive hive-users</value>
</property>
<property>
<name>key.acl.jdoe-key.DECRYPT_EEK</name>
<value>jdoe,hive</value>
</property>
If you have enabled HiveServer2 impersonation, data is accessed by
the user submitting the query or job, and the user account
(jdoe in this example) may still need to access
data in their home directory. In this scenario, the required
permissions are as follows:
<property>
<name>key.acl.hive-key.DECRYPT_EEK</name>
<value>nobody hive-users</value>
</property>
<property>
<name>key.acl.jdoe-key.DECRYPT_EEK</name>
<value>jdoe</value>
</property>
Hue
Recommendations
Make /user/hue an encryption zone because Oozie workflows and other
Hue-specific data are stored there by default. When you create the encryption zone, name
the key hue-key to take advantage of auto-generated KMS ACLs
(Configuring KMS Access Control Lists (ACLs)
).
Steps
On a cluster without Hue currently installed, create the
/user/hue directory and make it an encryption
zone.
On a cluster with Hue already installed:
- Create an empty
/user/hue-tmpdirectory. - Make
/user/hue-tmpan encryption zone. - DistCp all data from
/user/hueinto/user/hue-tmp. - Remove
/user/hueand rename/user/hue-tmpto/user/hue.
KMS ACL Configuration for Hue
In the KMS ACLs (Configuring KMS Access Control Lists (ACLs)
), grant the
hue and oozie users and groups
DECRYPT_EEK permission for the Hue key:
<property>
<name>key.acl.hue-key.DECRYPT_EEK</name>
<value>oozie,hue oozie,hue</value>
</property>
Impala
Recommendations
-
If HDFS encryption is enabled, configure Impala to encrypt data spilled to local disk.
-
In releases lower than Impala 2.2.0 / CDH 5.4.0, Impala does not support the
LOAD DATAstatement when the source and destination are in different encryption zones. If you are running an affected release and need to useLOAD DATAwith HDFS encryption enabled, copy the data to the table's encryption zone prior to running the statement. -
Use Cloudera Navigator to lock down the local directory where Impala UDFs are copied during execution. By default, Impala copies UDFs into /tmp, and you can configure this location through the
--local_library_dirstartup flag for the impalad daemon. -
Limit the rename operations for internal tables once encryption zones are set up. Impala cannot do an
ALTER TABLE RENAMEoperation to move an internal table from one database to another, if the root directories for those databases are in different encryption zones. If the encryption zone covers a table directory but not the parent directory associated with the database, Impala cannot do anALTER TABLE RENAMEoperation to rename an internal table, even within the same database. -
Avoid structuring partitioned tables where different partitions reside in different encryption zones, or where any partitions reside in an encryption zone that is different from the root directory for the table. Impala cannot do an
INSERToperation into any partition that is not in the same encryption zone as the root directory of the overall table. -
If the data files for a table or partition are in a different encryption zone than the HDFS trashcan, use the
PURGEkeyword at the end of theDROP TABLEorALTER TABLE DROP PARTITIONstatement to delete the HDFS data files immediately. Otherwise, the data files are left behind if they cannot be moved to the trashcan because of differing encryption zones. This syntax is available in Impala 2.3 / CDH 5.5 and higher.
Steps
Start every impalad process with the
--disk_spill_encryption=true flag set. This
encrypts all spilled data using AES-256-CFB. Set this flag by
selecting the Disk Spill Encryption checkbox in
the Impala configuration ().
KMS ACL Configuration for Impala
Cloudera recommends making the impala user a member
of the hive group, and following the ACL
recommendations in KMS ACL Configuration for Hive.
MapReduce and YARN
MapReduce v1
Recommendations
MRv1 stores both history and logs on local disks by default. Even if you do configure history to be stored on HDFS, the files are not renamed. Hence, no special configuration is required.
MapReduce v2 (YARN)
Recommendations
Make /user/history a single encryption zone, because history files are
moved between the intermediate and done directories,
and HDFS encryption does not allow moving encrypted files across encryption zones. When
you create the encryption zone, name the key mapred-key to take
advantage of auto-generated KMS ACLs (Configuring KMS Access Control Lists
(ACLs)
).
Steps
On a cluster with MRv2 (YARN) installed, create the
/user/history directory and make that an
encryption zone.
If /user/history already exists and is not
empty:
- Create an empty
/user/history-tmpdirectory. - Make
/user/history-tmpan encryption zone. - DistCp all data from
/user/historyinto/user/history-tmp. - Remove
/user/historyand rename/user/history-tmpto/user/history.
KMS ACL Configuration for MapReduce
In the KMS ACLs (Configuring KMS Access Control Lists (ACLs)
), grant
DECRYPT_EEK permission for the MapReduce key to the
mapred and yarn users and the
hadoop group:
<property>
<name>key.acl.mapred-key.DECRYPT_EEK</name>
<value>mapred,yarn hadoop</value>
</description>
</property>
Search
Recommendations
Make /solr an encryption zone. When you create the encryption zone, name
the key solr-key to take advantage of auto-generated KMS ACLs
(Configuring KMS Access Control Lists (ACLs)
).
Steps
On a cluster without Solr currently installed, create the
/solr directory and make that an encryption
zone.
On a cluster with Solr already installed:
- Create an empty
/solr-tmpdirectory. - Make
/solr-tmpan encryption zone. - DistCp all data from
/solrinto/solr-tmp. - Remove
/solr, and rename/solr-tmpto/solr.
KMS ACL Configuration for Search
In the KMS ACLs (Configuring KMS Access Control Lists (ACLs)
), grant the
solr user and group DECRYPT_EEK permission for the
Solr key:
<property>
<name>key.acl.solr-key.DECRYPT_EEK</name>
<value>solr solr</value>
</description>
</property>
Spark
Recommendations
- By default, application event logs are stored at
/user/spark/applicationHistory, which can be made into an encryption zone. - Spark also optionally caches its JAR file at
/user/spark/share/lib(by default), but encrypting this directory is not required. - Spark does not encrypt shuffle data. To do so, configure the Spark
local directory,
spark.local.dir(in Standalone mode), to reside on an encrypted disk. For YARN mode, make the corresponding YARN configuration changes.
KMS ACL Configuration for Spark
In the KMS ACLs (Configuring KMS Access Control Lists (ACLs)
), grant
DECRYPT_EEK permission for the Spark key to the spark
user and any groups that can submit Spark jobs:
<property>
<name>key.acl.spark-key.DECRYPT_EEK</name>
<value>spark spark-users</value>
</property>
Sqoop
Recommendations
- For Hive support: Ensure that you are using Sqoop with
the
--target-dirparameter set to a directory that is inside the Hive encryption zone. For more details, see Hive. - For append/incremental support: Make sure that the
sqoop.test.import.rootDirproperty points to the same encryption zone as the--target-dirargument. - For HCatalog support: No special configuration is required.
