Configuring Access to S3 on CDP Private Cloud Base

For Apache Hadoop applications to be able to interact with Amazon S3, they must know the AWS access key and the secret key. This can be achieved in three different ways: through configuration properties, environment variables, or instance metadata. While the first two options can be used when accessing S3 from a cluster running in your own data center. IAM roles, which use instance metadata should be used to control access to AWS resources if your cluster is running on EC2.

Table 1. Authentication Options for Different Deployment Scenarios
Deployment Scenario Authentication Options
Cluster runs on EC2 Use IAM roles to control access to your AWS resources. If you configure role-based access, instance metadata will automatically be used to authenticate.
Cluster runs in your own data center Use the below mentioned configuration properties to authenticate. You can set the configuration properties globally or per-bucket.

Temporary security credentials, also known as "session credentials", can be issued. These consist of a secret key with a limited lifespan, along with a session token, another secret which must be known and used alongside the access key. The secret key is never passed to AWS services directly. Instead it is used to sign the URL and headers of the HTTP request.

By default, the S3A filesystem client follows the following authentication chain:

  1. If login details were provided in the filesystem URI, a warning is printed and then the username and password are extracted for the AWS key and secret respectively. You may authenticate using per-bucket authentication credentials.

  2. The fs.s3a.access.key and fs.s3a.secret.key are looked for in the Hadoop configuration properties.

  3. The AWS environment variables are then looked for.

  4. An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs.

Using Configuration Properties to Authenticate

To configure authentication with S3, explicitly declare the credentials in a configuration file such as core-site.xml:

<property>
   <name>fs.s3a.access.key</name> 
   <value>ACCESS-KEY</value> 
</property>

<property> 
  <name>fs.s3a.secret.key</name>
  <value>SECRET-KEY</value>
</property>

If using AWS session credentials for authentication, the secret key must be that of the session, and the fs.s3a.session.token option set to your session token.

<property> 
   <name>fs.s3a.session.token</name> 
   <value>SESSION-TOKEN</value> 
</property>

This configuration can be added for a specific bucket. To validate that you can successfully authenticate with S3, try referencing S3 in a URL.

Using Per-Bucket Credentials to Authenticate

S3A supports per-bucket configuration, which can be used to declare different authentication credentials and authentication mechanisms for different buckets.

For example, a bucket s3a://nightly/ used for nightly data can be configured with a session key:

<property> 
   <name>fs.s3a.bucket.nightly.access.key</name> 
   <value>AKAACCESSKEY-2</value> 
</property> 

<property> 
  <name>fs.s3a.bucket.nightly.secret.key</name> 
  <value>SESSIONSECRETKEY</value> 
</property>

Similarly, you can set a session token for a specific bucket:

<property> 
   <name>fs.s3a.bucket.nightly.session.token</name> 
   <value>SESSION-TOKEN</value> 
</property>
This technique is useful for working with external sources of data, or when copying data between buckets belonging to different accounts.

Using Environment Variables to Authenticate

AWS CLI supports authentication through environment variables. These same environment variables will be used by Hadoop if no configuration properties are set.

The environment variables are:

Environment Variable Description
AWS_ACCESS_KEY_ID Access key
AWS_SECRET_ACCESS_KEY Secret key
AWS_SESSION_TOKEN Session token (only if using session authentication)