AWS Glue is a serverless data integration service. This guide demonstrates deploying a “hello-world” processing job using Hamilton functions on AWS Glue.
First things first, AWS Glue jobs run a single python script, but you can include external code (like our Hamilton functions) by adding it as a python wheel. So, let's package our code and get it ready for action.
Install build package:
This command installs the ‘build’ package, which we'll use to create our python wheel.
pip install build
Build python wheel:
cd app \ && python -m build --wheel --skip-dependency-check \ && cd ..
Upload the wheel file to S3:
Replace <YOUR_PATH_TO_WHL> with your specific S3 bucket and path:
aws s3 cp \ app/dist/hamilton_functions-0.1-py3-none-any.whl \ s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl
Upload main python script to s3:
Replace <YOUR_PATH_TO_SCRIPT> with your specific S3 bucket and path:
aws s3 cp \ processing.py \ s3://<YOUR_PATH_TO_SCRIPT>/processing.py
Upload input data to s3:
Replace <YOUR_PATH_TO_INPUT_DATA> with your specific S3 bucket and path:
aws s3 cp \ data/input_table.csv \ s3://<YOUR_PATH_TO_INPUT_DATA>
Create the Role:
aws iam create-role \ --role-name GlueProcessorRole \ --assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'
Attach Policies to the Role:
Here we grant full access to S3 as an example. For production environments it's important to restrict access appropriately.
aws iam attach-role-policy \ --role-name GlueProcessorRole \ --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess aws iam attach-role-policy \ --role-name GlueProcessorRole \ --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
Create a job:
Ensure all paths are correctly replaced with the actual ones:
aws glue create-job \ --name test_hamilton_script \ --role GlueProcessorRole \ --command '{"Name" : "pythonshell", "PythonVersion": "3.9", "ScriptLocation" : "s3://<YOUR_PATH_TO_SCRIPT>/processing.py"}' \ --max-capacity 0.0625 \ --default-arguments '{"--extra-py-files" : "s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl", "--additional-python-modules" : "sf-hamilton"}'
Run the job:
Ensure all paths are correctly replaced with the actual ones:
aws glue start-job-run \ --job-name test_hamilton_script \ --arguments '{"--input-table" : "s3://<YOUR_PATH_TO_INPUT_DATA>", "--output-table" : "s3://<YOUR_PATH_TO_OUTPUT_DATA>"}'
Once you've run the job, you should see an output file at s3://<YOUR_PATH_TO_OUTPUT_DATA>.