Configuring Access to S3 on CDP Private Cloud Base
For Apache Hadoop applications to be able to interact with Amazon S3, they must know the AWS access key and the secret key. This can be achieved in three different ways: through configuration properties, environment variables, or instance metadata. While the first two options can be used when accessing S3 from a cluster running in your own data center. IAM roles, which use instance metadata should be used to control access to AWS resources if your cluster is running on EC2.
Deployment Scenario | Authentication Options |
---|---|
Cluster runs on EC2 | Use IAM roles to control access to your AWS resources. If you configure role-based access, instance metadata will automatically be used to authenticate. |
Cluster runs in your own data center | Use the below mentioned configuration properties to authenticate. You can set the configuration properties globally or per-bucket. |
Temporary security credentials, also known as "session credentials", can be issued. These consist of a secret key with a limited lifespan, along with a session token, another secret which must be known and used alongside the access key. The secret key is never passed to AWS services directly. Instead it is used to sign the URL and headers of the HTTP request.
By default, the S3A filesystem client follows the following authentication chain:
-
If login details were provided in the filesystem URI, a warning is printed and then the username and password are extracted for the AWS key and secret respectively. You may authenticate using per-bucket authentication credentials.
-
The
fs.s3a.access.key
andfs.s3a.secret.key
are looked for in the Hadoop configuration properties. -
The AWS environment variables are then looked for.
-
An attempt is made to query the Amazon EC2 Instance Metadata Service to retrieve credentials published to EC2 VMs.
Using Configuration Properties to Authenticate
To configure authentication with S3, explicitly declare the credentials in a configuration
file such as core-site.xml
:
<property>
<name>fs.s3a.access.key</name>
<value>ACCESS-KEY</value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value>SECRET-KEY</value>
</property>
If using AWS session credentials for authentication, the secret key must be that of the
session, and the fs.s3a.session.token
option set to your session token.
<property>
<name>fs.s3a.session.token</name>
<value>SESSION-TOKEN</value>
</property>
This configuration can be added for a specific bucket. To validate that you can successfully authenticate with S3, try referencing S3 in a URL.
Using Per-Bucket Credentials to Authenticate
S3A supports per-bucket configuration, which can be used to declare different authentication credentials and authentication mechanisms for different buckets.
For example, a bucket s3a://nightly/
used for nightly data can be
configured with a session key:
<property>
<name>fs.s3a.bucket.nightly.access.key</name>
<value>AKAACCESSKEY-2</value>
</property>
<property>
<name>fs.s3a.bucket.nightly.secret.key</name>
<value>SESSIONSECRETKEY</value>
</property>
Similarly, you can set a session token for a specific bucket:
<property>
<name>fs.s3a.bucket.nightly.session.token</name>
<value>SESSION-TOKEN</value>
</property>
This
technique is useful for working with external sources of data, or when copying data between
buckets belonging to different accounts.Using Environment Variables to Authenticate
AWS CLI supports authentication through environment variables. These same environment variables will be used by Hadoop if no configuration properties are set.
The environment variables are:
Environment Variable | Description |
---|---|
AWS_ACCESS_KEY_ID |
Access key |
AWS_SECRET_ACCESS_KEY |
Secret key |
AWS_SESSION_TOKEN |
Session token (only if using session authentication) |