blob: e6e70912c48e83981cd4c78768310a2f827b1e10 [file] [view]
# Deploy Hamilton Functions as an AWS Glue Job
[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service. This guide demonstrates deploying a "hello-world" [processing job](https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html) using Hamilton functions on AWS Glue.
## Prerequisites
- **AWS CLI Setup**: Make sure the AWS CLI is set up on your machine. If you haven't done this yet, no worries! You can follow the [Quick Start guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html) for easy setup instructions.
## Step-by-Step Guide
### 1. Build wheel with Hamilton functions
First things first, AWS Glue jobs run a single python script, but you can include external code (like our Hamilton functions) by adding it as a python wheel. So, let's package our code and get it ready for action.
- **Install build package:**
This command installs the 'build' package, which we'll use to create our python wheel.
```shell
pip install build
```
- **Build python wheel:**
```shell
cd app \
&& python -m build --wheel --skip-dependency-check \
&& cd ..
```
### 2. Upload all necessary files to S3
- **Upload the wheel file to S3:**
Replace `<YOUR_PATH_TO_WHL>` with your specific S3 bucket and path:
```shell
aws s3 cp \
app/dist/hamilton_functions-0.1-py3-none-any.whl \
s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl
```
- **Upload main python script to s3:**
Replace `<YOUR_PATH_TO_SCRIPT>` with your specific S3 bucket and path:
```shell
aws s3 cp \
processing.py \
s3://<YOUR_PATH_TO_SCRIPT>/processing.py
```
- **Upload input data to s3:**
Replace `<YOUR_PATH_TO_INPUT_DATA>` with your specific S3 bucket and path:
```shell
aws s3 cp \
data/input_table.csv \
s3://<YOUR_PATH_TO_INPUT_DATA>
```
### 3. Create a simple role for AWS Glue job execution
- **Create the Role**:
```shell
aws iam create-role \
--role-name GlueProcessorRole \
--assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'
```
- **Attach Policies to the Role**:
Here we grant full access to S3 as an example. For production environments it's important to restrict access appropriately.
```shell
aws iam attach-role-policy \
--role-name GlueProcessorRole \
--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
aws iam attach-role-policy \
--role-name GlueProcessorRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
```
### 4. Create and run the job
- **Create a job:**
Ensure all paths are correctly replaced with the actual ones:
```shell
aws glue create-job \
--name test_hamilton_script \
--role GlueProcessorRole \
--command '{"Name" : "pythonshell", "PythonVersion": "3.9", "ScriptLocation" : "s3://<YOUR_PATH_TO_SCRIPT>/processing.py"}' \
--max-capacity 0.0625 \
--default-arguments '{"--extra-py-files" : "s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl", "--additional-python-modules" : "sf-hamilton"}'
```
- **Run the job:**
Ensure all paths are correctly replaced with the actual ones:
```shell
aws glue start-job-run \
--job-name test_hamilton_script \
--arguments '{"--input-table" : "s3://<YOUR_PATH_TO_INPUT_DATA>", "--output-table" : "s3://<YOUR_PATH_TO_OUTPUT_DATA>"}'
```
Once you've run the job, you should see an output file at `s3://<YOUR_PATH_TO_OUTPUT_DATA>`.