examples/aws/glue/README.md - hamilton - Git at Google

 # Deploy Hamilton Functions as an AWS Glue Job

 [AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service. This guide demonstrates deploying a "hello-world" [processing job](https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html) using Hamilton functions on AWS Glue.

 ## Prerequisites

 - **AWS CLI Setup**: Make sure the AWS CLI is set up on your machine. If you haven't done this yet, no worries! You can follow the [Quick Start guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html) for easy setup instructions.

 ## Step-by-Step Guide

 ### 1. Build wheel with Hamilton functions

 First things first, AWS Glue jobs run a single python script, but you can include external code (like our Hamilton functions) by adding it as a python wheel. So, let's package our code and get it ready for action.

 - **Install build package:**

     This command installs the 'build' package, which we'll use to create our python wheel.

     ```shell
     pip install build
     ```

 - **Build python wheel:**

     ```shell
     cd app \
         && python -m build --wheel --skip-dependency-check \
         && cd ..
     ```

 ### 2. Upload all necessary files to S3

 - **Upload the wheel file to S3:**

     Replace `<YOUR_PATH_TO_WHL>` with your specific S3 bucket and path:

     ```shell
     aws s3 cp \
         app/dist/hamilton_functions-0.1-py3-none-any.whl \
         s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl
     ```

 - **Upload main python script to s3:**

     Replace `<YOUR_PATH_TO_SCRIPT>` with your specific S3 bucket and path:

     ```shell
     aws s3 cp \
         processing.py \
         s3://<YOUR_PATH_TO_SCRIPT>/processing.py
     ```

 - **Upload input data to s3:**

     Replace `<YOUR_PATH_TO_INPUT_DATA>` with your specific S3 bucket and path:

     ```shell
     aws s3 cp \
         data/input_table.csv \
         s3://<YOUR_PATH_TO_INPUT_DATA>
     ```

 ### 3. Create a simple role for AWS Glue job execution

 - **Create the Role**:

     ```shell
     aws iam create-role \
         --role-name GlueProcessorRole \
         --assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'
     ```

 - **Attach Policies to the Role**:

     Here we grant full access to S3 as an example. For production environments it's important to restrict access appropriately.

     ```shell
     aws iam attach-role-policy \
         --role-name GlueProcessorRole \
         --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
     aws iam attach-role-policy \
         --role-name GlueProcessorRole \
         --policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
     ```

 ### 4. Create and run the job

 - **Create a job:**

     Ensure all paths are correctly replaced with the actual ones:

     ```shell
     aws glue create-job \
         --name test_hamilton_script \
         --role GlueProcessorRole \
         --command '{"Name" :  "pythonshell", "PythonVersion": "3.9", "ScriptLocation" : "s3://<YOUR_PATH_TO_SCRIPT>/processing.py"}' \
         --max-capacity 0.0625 \
         --default-arguments '{"--extra-py-files" : "s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl", "--additional-python-modules" : "sf-hamilton"}'
     ```

 - **Run the job:**

     Ensure all paths are correctly replaced with the actual ones:

     ```shell
     aws glue start-job-run \
         --job-name test_hamilton_script \
         --arguments '{"--input-table" : "s3://<YOUR_PATH_TO_INPUT_DATA>", "--output-table" : "s3://<YOUR_PATH_TO_OUTPUT_DATA>"}'
     ```

     Once you've run the job, you should see an output file at `s3://<YOUR_PATH_TO_OUTPUT_DATA>`.
	# Deploy Hamilton Functions as an AWS Glue Job

	[AWS Glue](https://aws.amazon.com/glue/) is a serverless data integration service. This guide demonstrates deploying a "hello-world" [processing job](https://docs.aws.amazon.com/glue/latest/dg/add-job-python.html) using Hamilton functions on AWS Glue.

	## Prerequisites

	- AWS CLI Setup: Make sure the AWS CLI is set up on your machine. If you haven't done this yet, no worries! You can follow the [Quick Start guide](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html) for easy setup instructions.

	## Step-by-Step Guide

	### 1. Build wheel with Hamilton functions

	First things first, AWS Glue jobs run a single python script, but you can include external code (like our Hamilton functions) by adding it as a python wheel. So, let's package our code and get it ready for action.

	- Install build package:

	This command installs the 'build' package, which we'll use to create our python wheel.

	```shell
	pip install build
	```

	- Build python wheel:

	```shell
	cd app \
	&& python -m build --wheel --skip-dependency-check \
	&& cd ..
	```

	### 2. Upload all necessary files to S3

	- Upload the wheel file to S3:

	Replace `<YOUR_PATH_TO_WHL>` with your specific S3 bucket and path:

	```shell
	aws s3 cp \
	app/dist/hamilton_functions-0.1-py3-none-any.whl \
	s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl
	```

	- Upload main python script to s3:

	Replace `<YOUR_PATH_TO_SCRIPT>` with your specific S3 bucket and path:

	```shell
	aws s3 cp \
	processing.py \
	s3://<YOUR_PATH_TO_SCRIPT>/processing.py
	```

	- Upload input data to s3:

	Replace `<YOUR_PATH_TO_INPUT_DATA>` with your specific S3 bucket and path:

	```shell
	aws s3 cp \
	data/input_table.csv \
	s3://<YOUR_PATH_TO_INPUT_DATA>
	```

	### 3. Create a simple role for AWS Glue job execution

	- Create the Role:

	```shell
	aws iam create-role \
	--role-name GlueProcessorRole \
	--assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{ "Effect": "Allow", "Principal": { "Service": "glue.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'
	```

	- Attach Policies to the Role:

	Here we grant full access to S3 as an example. For production environments it's important to restrict access appropriately.

	```shell
	aws iam attach-role-policy \
	--role-name GlueProcessorRole \
	--policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess
	aws iam attach-role-policy \
	--role-name GlueProcessorRole \
	--policy-arn arn:aws:iam::aws:policy/service-role/AWSGlueServiceRole
	```

	### 4. Create and run the job

	- Create a job:

	Ensure all paths are correctly replaced with the actual ones:

	```shell
	aws glue create-job \
	--name test_hamilton_script \
	--role GlueProcessorRole \
	--command '{"Name" : "pythonshell", "PythonVersion": "3.9", "ScriptLocation" : "s3://<YOUR_PATH_TO_SCRIPT>/processing.py"}' \
	--max-capacity 0.0625 \
	--default-arguments '{"--extra-py-files" : "s3://<YOUR_PATH_TO_WHL>/hamilton_functions-0.1-py3-none-any.whl", "--additional-python-modules" : "sf-hamilton"}'
	```

	- Run the job:

	Ensure all paths are correctly replaced with the actual ones:

	```shell
	aws glue start-job-run \
	--job-name test_hamilton_script \
	--arguments '{"--input-table" : "s3://<YOUR_PATH_TO_INPUT_DATA>", "--output-table" : "s3://<YOUR_PATH_TO_OUTPUT_DATA>"}'
	```

	Once you've run the job, you should see an output file at `s3://<YOUR_PATH_TO_OUTPUT_DATA>`.