This Git repository contains several Terraform configurations.
shared_state
creates Terraform state storage in either Azure or AWS, which is a prerequisite for the Terraform configurations in aws
or azure
.shared_state/aws
creates an AWS S3 Bucket and DynamoDB table that are a prerequisite for the Terraform configuration in aws
.shared_state/azure
creates an Azure resource group and storage account that are a prerequisite for the Terraform configuration in azure
.aws
creates the following AWS resources:azure
creates the following Azure resources:You will need to download and install the correct Terraform CLI for your platform. Put the terraform
binary on your PATH. You can optionally install Terraform Docs if you want to be able to generate documentation or an example variables file for either the shared state or aws
or azure
configurations.
The shared_state
directory contains Terraform configurations for creating either an AWS S3 Bucket or DynamoDB table, or an Azure resource group, storage account, and container. These objects only need to be created once and are used for sharing the Terraform state with a team. To read more about this see remote state. The AWS shared state instructions are based on this article.
To generate the storage, run terraform init
followed by terraform apply
. Note that the shell working directory must be the shared_state/aws
or shared_state/azure
directory when you run the terraform commands for shared state creation.
The default AWS configuration generates the S3 bucket name when terraform apply
is run. This ensures that a globally unique S3 bucket name is used. It is not required to set any variables for the shared state. However, if you wish to override any variable values, this can be done by creating an aws.auto.tfvars
file in the shared_state/aws
directory. For example:
cd shared_state/aws cat > aws.auto.tfvars << EOF bucket_force_destroy = true EOF
Assuming the bucket variable is not overridden, the generated S3 bucket name will appear in the terraform
apply output, like the following example:
Outputs: bucket_name = "terraform-20220209131315353700000001"
This value should be supplied to terraform init
in the aws directory as described below. Using the example above, the init command for the aws directory would be:
terraform init -backend-config=bucket=terraform-20220209131315353700000001
If you change any of the backend storage configuration parameters over their defaults, you will need to override them when you initialize terraform for the aws
or azure
configuration below. For example, if you change the region where the S3 bucket is deployed from us-east-1
to us-west-2
, then you would need to run terraform init
in the aws
directory (not the shared_state initialization, but the main aws
directory initialization) with:
terraform init -backend-config=region=us-west-2
The following backend configuration can be overridden from with -backend-config=<name>=<value>
options to terraform init
. This prevents the need to modify the backend
sections in aws/main.tf or azure/main.tf.
For AWS:
-backend-config=bucket=<bucket_name>
: Override the S3 bucket name-backend-config=key=<key_name>
: Override the key in the S3 bucket-backend-config=region=<region>
: Override AWS region-backend-config=dynamodb_table=<dynamodb_table_name>
: Override the DynamoDB table nameFor Azure:
-backend-config=resource_group_name=<resource_group_name>
: Override the resource group where the storage account is located-backend-config=storage_account_name=<storage_account_name>
: Override the name of the Azure storage account holding Terraform state-backend-config=container_name=<container_name>
: Override the name of the container within the storage account that is holding Terraform state-backend-config=key=<blob_name>
: Override the name of the blob within the container that will be used to hold Terraform stateThe aws
and azure
directories contain Terraform configurations for creating an Accumulo cluster on AWS or Azure respectively. The aws
and azure
directories contain the following Terraform configuration items:
aws
and azure
Terraform configurationscloud-init-config
- contains templates to generate a Cloud Init configuration to configure AWS instances or Azure VMs with necessary Linux packages, user accounts, etc.config-files
- contains template configuration files for various components of the cluster (e.g., HDFS, Accumulo, Grafana, etc.) as well as helper scripts to install the software components that cannot be installed via cloud-init.upload-software
- if pre-built binaries for downloaded software components (Hadoop, Accumulo, Zookeeper, Maven) are included, this module uploads them to the clusterconfigure-nodes
- this module is responsible for executing scripts on the cluster to install and configure software, initialize the HDFS filesystem and Accumulo cluster, and start them.The table below lists the variables and their default values that are used in the aws
configuration.
Name | Description | Type | Default | Required |
---|---|---|---|---|
accumulo_branch_name | The name of the branch to build and install | string | "main" | no |
accumulo_dir | The Accumulo directory on each EC2 node | string | "/data/accumulo" | no |
accumulo_instance_name | The accumulo instance name. | string | "accumulo-testing" | no |
accumulo_repo | URL of the Accumulo git repo | string | "https://github.com/apache/accumulo.git" | no |
accumulo_root_password | The password for the accumulo root user. A randomly generated password will be used if none is specified here. | string | null | no |
accumulo_testing_branch_name | The name of the branch to build and install | string | "main" | no |
accumulo_testing_repo | URL of the Accumulo Testing git repo | string | "https://github.com/apache/accumulo-testing.git" | no |
accumulo_version | The branch of Accumulo to download and install | string | "2.1.0-SNAPSHOT" | no |
ami_name_pattern | The pattern of the name of the AMI to use | any | n/a | yes |
ami_owner | The id of the AMI owner | any | n/a | yes |
authorized_ssh_key_files | List of SSH public key files for the developers that will log into the cluster | list(string) | [] | no |
authorized_ssh_keys | List of SSH keys for the developers that will log into the cluster | list(string) | n/a | yes |
cloudinit_merge_type | Describes the merge behavior for overlapping config blocks in cloud-init. | string | null | no |
create_route53_records | Indicates whether or not route53 records will be created | bool | false | no |
hadoop_dir | The Hadoop directory on each EC2 node | string | "/data/hadoop" | no |
hadoop_version | The version of Hadoop to download and install | string | "3.3.4" | no |
instance_count | The number of EC2 instances to create | string | "2" | no |
instance_type | The type of EC2 instances to create | string | "m5.2xlarge" | no |
local_sources_dir | Directory on local machine that contains Maven, ZooKeeper or Hadoop binary distributions or Accumulo source tarball | string | "" | no |
maven_version | The version of Maven to download and install | string | "3.8.6" | no |
optional_cloudinit_config | An optional config block for the cloud-init script. If you set this, you should consider setting cloudinit_merge_type to handle merging with the default script as you need. | string | null | no |
private_network | Indicates whether or not the user is on a private network and access to hosts should be through the private IP addresses rather than public ones. | bool | false | no |
root_volume_gb | The size, in GB, of the EC2 instance root volume | string | "300" | no |
route53_zone | The name of the Route53 zone in which to create DNS addresses | any | n/a | yes |
security_group | The Security Group to use when creating AWS objects | any | n/a | yes |
software_root | The full directory root where software will be installed | string | "/opt/accumulo-testing" | no |
us_east_1b_subnet | The AWS subnet id for the us-east-1b subnet | any | n/a | yes |
us_east_1e_subnet | The AWS subnet id for the us-east-1e subnet | any | n/a | yes |
zookeeper_dir | The ZooKeeper directory on each EC2 node | string | "/data/zookeeper" | no |
zookeeper_version | The version of ZooKeeper to download and install | string | "3.8.0" | no |
The following outputs are returned by the aws
Terraform configuration.
Name | Description |
---|---|
accumulo_root_password | The supplied, or automatically generated Accumulo root user password. |
manager_ip | The IP address of the manager instance. |
worker_ips | The IP addresses of the worker instances. |
The table below lists the variables and their default values that are used in the azure
configuration.
Name | Description | Type | Default | Required |
---|---|---|---|---|
accumulo_branch_name | The name of the branch to build and install | string | "main" | no |
accumulo_dir | The Accumulo directory on each node | string | "/data/accumulo" | no |
accumulo_instance_name | The accumulo instance name. | string | "accumulo-testing" | no |
accumulo_repo | URL of the Accumulo git repo | string | "https://github.com/apache/accumulo.git" | no |
accumulo_root_password | The password for the accumulo root user. A randomly generated password will be used if none is specified here. | string | null | no |
accumulo_testing_branch_name | The name of the branch to build and install | string | "main" | no |
accumulo_testing_repo | URL of the Accumulo Testing git repo | string | "https://github.com/apache/accumulo-testing.git" | no |
accumulo_version | The branch of Accumulo to download and install | string | "2.1.0-SNAPSHOT" | no |
admin_username | The username of the admin user, that can be authenticated with the first public ssh key. | string | "azureuser" | no |
authorized_ssh_key_files | List of SSH public key files for the developers that will log into the cluster | list(string) | [] | no |
authorized_ssh_keys | List of SSH keys for the developers that will log into the cluster | list(string) | n/a | yes |
cloudinit_merge_type | Describes the merge behavior for overlapping config blocks in cloud-init. | string | null | no |
create_resource_group | Indicates whether or not resource_group_name should be created or is an existing resource group. | bool | true | no |
hadoop_dir | The Hadoop directory on each node | string | "/data/hadoop" | no |
hadoop_version | The version of Hadoop to download and install | string | "3.3.4" | no |
local_sources_dir | Directory on local machine that contains Maven, ZooKeeper or Hadoop binary distributions or Accumulo source tarball | string | "" | no |
location | The Azure region where resources are to be created. If an existing resource group is specified, this value is ignored and the resource group's location is used. | string | n/a | yes |
managed_disk_configuration | Optional managed disk configuration. If supplied, the managed disks on each VM will be combined into an LVM volume mounted at the named mount point. | object({ mount_point = string disk_count = number storage_account_type = string disk_size_gb = number }) | null | no |
maven_version | The version of Maven to download and install | string | "3.8.6" | no |
network_address_space | The network address space to use for the virtual network. | list(string) | [ “10.0.0.0/16” ] | no |
optional_cloudinit_config | An optional config block for the cloud-init script. If you set this, you should consider setting cloudinit_merge_type to handle merging with the default script as you need. | string | null | no |
os_disk_caching | The type of caching to use for the OS disk. Possible values are None, ReadOnly, and ReadWrite. | string | "ReadOnly" | no |
os_disk_size_gb | The size, in GB, of the OS disk | number | 300 | no |
os_disk_type | The disk type to use for OS disks. Possible values are Standard_LRS, StandardSSD_LRS, and Premium_LRS. | string | "Standard_LRS" | no |
resource_group_name | The name of the resource group to create or reuse. If not specified, the name is generated based on resource_name_prefix. | string | "" | no |
resource_name_prefix | A prefix applied to all resource names created by this template. | string | "accumulo-testing" | no |
software_root | The full directory root where software will be installed | string | "/opt/accumulo-testing" | no |
subnet_address_prefixes | The subnet address prefixes to use for the accumulo testing subnet. | list(string) | [ “10.0.2.0/24” ] | no |
vm_image | n/a | object({ publisher = string offer = string sku = string version = string }) | { “offer”: “0001-com-ubuntu-server-focal”, “publisher”: “Canonical”, “sku”: “20_04-lts-gen2”, “version”: “latest” } | no |
vm_sku | The SKU of Azure VMs to create | string | "Standard_D8s_v4" | no |
worker_count | The number of worker VMs to create | number | 1 | no |
zookeeper_dir | The ZooKeeper directory on each node | string | "/data/zookeeper" | no |
zookeeper_version | The version of ZooKeeper to download and install | string | "3.8.0" | no |
The following outputs are returned by the azure
Terraform configuration.
Name | Description |
---|---|
accumulo_root_password | The user-supplied or automatically generated Accumulo root user password. |
manager_ip | The public IP address of the manager VM. |
worker_ips | The public IP addresses of the worker VMs. |
When using either the aws
or azure
configuration, you will need to supply values for required variables that have no default value. There are several ways to do this. If you installed Terraform Docs, it can generate the file for you. You can then edit the generated file to configure values as desired:
CLOUD=<enter either aws or azure> cd $CLOUD terraform-docs tfvars hcl . > ${CLOUD}.auto.tfvars # If you prefer JSON over HCL, then the command would be # terraform-docs tfvars json . > ${CLOUD}.auto.tfvars.json
Note that these generated variable files will include values for all variables, where those with defaults will be set to their default value. You can also refer to the tables above and simply add the values that are required (and have no default, or a default that you wish to change). Below is an example JSON file containing configuration for aws
. This content can be customized and placed in the aws
directory in a file whose name ends with .auto.tfvars.json
. Any variable files whose name ends in .auto.tfvars
or .auto.tfvars.json
are automatically included when terraform
commands are executed.
{ "security_group": "sg-ABCDEF001", "route53_zone": "some.domain.com", "us_east_1b_subnet": "subnet-ABCDEF123", "us_east_1e_subnet": "subnet-ABCDEF124", "ami_owner": "000000000001", "ami_name_pattern": "MY_AMI_*", "authorized_ssh_keys": [ "ssh-rsa dev_key_1", "ssh-rsa dev_key_2" ] }
The cloud-init template can be found in cloud-init.tftpl. If you need to customize this configuration, one method is to use the Terraform variable optional_cloudinit_config
to supply your own additional configuration. For example, some CentOS 7 images are out of date, and will need software packages to be updated before the rest of the software download/install will work. This can be accomplished by adding the following to your .auto.tfvars
file:
optional_cloudinit_config = <<-EOT package_upgrade: true EOT
You can add any other cloud-init configuration that you wish here. One factor to consider here is the cloud-init merging behavior with sections in the default template. The merging behavior can be controlled by setting the cloudinit_merge_type
variable to your desired merge algorithm. The default is set to dict(recurse_array,no_replace)+list(append)
which will attempt to keep all lists from the default configuration, rather than new ones overwriting them.
Another factor to consider is the size of the generated cloud-init template. Cloud providers place a limit on the size of this file. AWS limits this content to 16KB, before Base64 encoding, and Azure limits it to 64KB after Base64 encoding.
This Terraform configuration creates:
${instance_count}
EC2 nodes of ${instance_type}
with the latest AMI matching ${ami_name_pattern}
from the ${ami_owner}
. Each EC2 node will have a ${root_volume_gb}
GB root volume. The EFS filesystem is NFS mounted to each node at ${software_root}
.This Terraform configuration:
${maven_version}
binary tarball to ${software_root}/sources
, then untars it to ${software_root}/apache-maven/apache-maven-${maven_version}
${zookeer_version}
binary tarball to ${software_root}/sources
, then untars it to ${software_root}/zookeeper/apache-zookeeper-${zookeeper_version}-bin
${hadoop_version}
binary tarball to ${software_root}/sources
, then untars it to ${software_root}/hadoop/hadoop-${hadoop_version}
${accumulo_repo}
into ${software_root}/sources/accumulo-repo
. It switches to the ${accumulo_branch_name}
branch and builds the software using Maven, then untars the binary tarball to ${software_root}/accumulo/accumulo-${accumulo_version}
${software_root}/accumulo/accumulo-${accumulo_version}/lib/opentelemetry-javaagent-1.19.0.jar
test
jar to ${software_root}/accumulo/accumulo-${accumulo_version}/lib
so that org.apache.accumulo.test.metrics.TestStatsDRegistryFactory
is on the classpath${software_root}/accumulo/accumulo-${accumulo_version}/lib/micrometer-registry-statsd-1.9.5.jar
${accumulo_testing_repo}
into ${software_root}/sources/accumulo-testing-repo
. It switches to the ${accumulo_testing_branch_name}
branch and builds the software using Maven.If you want to supply your own Apache Maven, Apache ZooKeeper, Apache Hadoop, Apache Accumulo, or Apache Accumulo Testing binary tar files, then you can put them into a directory on your local machine and set the ${local_sources_dir}
variable to the full path to the directory. These files will be uploaded to ${software_root}/sources
and the installation script will use them instead of downloading them. If the version of the supplied binary tarball is different than the default version, then you will also need to override that property. Supplying your own binary tarballs does speed up the deployment. However, if you provide the Apache Accumulo binary tarball, then it will be harder to update the software on the cluster.
NOTE: If you supply your own binary tarball of Accumulo, then you will need to copy the accumulo-test-${accumulo_version}.jar
file to the lib
directory manually as it's not part of the binary tarball.
If you did not provide a binary tarball, then you can update the software running on the cluster by doing the following and then restarting Accumulo:
cd ${software_root}/sources/accumulo-repo git pull mvn clean package -DskipTests -DskipITs # Backup the Accumulo configs mkdir -p ~/accumulo-config-backup cp ${software_root}/accumulo/accumulo-${accumulo_version}/conf/* ~/accumulo-config-backup/. # Lay down the updated Accumulo distribution tar zxf assemble/target/accumulo-${accumulo_version}-bin.tar.gz -C ${software_root}/accumulo # Restore the Accumulo configs cp ~/accumulo-config-backup/* ${software_root}/accumulo/accumulo-${accumulo_version}/conf/. # Sync the Accumulo changes with the worker nodes pdsh -R exec -g worker rsync -az ${software_root}/accumulo/ %h:${software_root}/accumulo/
If you did not provide a binary tarball, then you can update the software running on the cluster by doing the following:
cd ${software_root}/sources/accumulo-testing-repo git pull mvn clean package -DskipTests -DskipITs
The first node that is created is called the manager
, the others are worker
nodes. The following components will run on the manager
node:
The following components will run on the worker
nodes:
The logs for each service (zookeeper, hadoop, accumulo) are located in their respective local directory on each node (/data/${service}/logs
unless you changed the properties).
The aws
Terraform configuration creates DNS entries of the following form:
<node_name>-<branch_name>-<workspace_name>.${route53_zone}
For example:
The azure
configuration does not current create public DNS entries for the nodes, and it is recommended that the public IP addresses be used instead.
aws
or azure
directory in your shell. This must be the current directory when you run the following terraform
commands..auto.tfvars
file, or set the properties some other way, run terraform init
. If you have modified shared_state backend configuration over the default, you can override the values here. For example, the following configuration updates the resource_group_name
and storage_account_name
for the azurerm
backend:terraform init -backend-config=resource_group_name=my-tfstate-resource-group -backend-config=storage_account_name=mystorageaccountnameOnce values are supplied to
terraform init
, they are stored in the local state and it is not necessary to supply these overrides to the terraform apply
or terraform destroy
commands.authorized_ssh_keys
or authorized_ssh_key_files
in your .auto.tfvars
file is loaded into your SSH agent. During resource creation, Terraform will connect to the newly created VMs using SSH in order copy files and configure the VMs to run Accumulo. If the appropriate private key is not available to your SSH agent, then the connection will fail and resource creation will eventually fail.terraform apply
to create the AWS/Azure resources.terraform destroy
to tear down the AWS/Azure resources.NOTE: If you are working with aws
and get an Access Denied error then try setting the AWS Short Term access keys in your environment
For an aws
cluster, you can access the software configuration/management web pages here:
The azure
cluster creates a network security group that limits public access to port 22 (SSH). Therefore, to access configuration/management web pages, you should create a SOCKS proxy and use a browser plugin such as FoxyProxy Standard to point the browser to the SOCKS proxy. Create the proxy with
ssh -C2qTnNf -D 9876 hadoop@<manager-public-ip-address>
Configure FoxyProxy (or your browser directly) to connect to the proxy on localhost port 9876 (change the port specified in the -D
option above to use a different proxy port). If you configure FoxyProxy with a SOCKS 5 proxy to match the URL regex patterns https?://manager:.*
and https?://worker[0-9]+:.*
, then you can leave FoxyProxy set to “Use proxies based on their pre-defined patterns and priorities” and access the web pages through the proxy while other web pages will not use the proxy.
The cloud-init configuration applied to each AWS instance or Azure VM creates a hadoop
user. Any public SSH keys specified in the Terraform configuration variable authorized_ssh_keys
(or public key file named in authorized_ssh_key_files
) will be included in the cloud-init template as an authorized key for the hadoop
user.
If you wish to use your default ssh key, typically stored in ~/.ssh/id_rsa.pub
, you would add the following to your HCL .auto.tfvars
file:
authorized_ssh_key_files = [ "~/.ssh/id_rsa.pub" ]
Then, when the cluster is created, you can log in to a node with ssh hadoop@<node-public-ip-address>
.
The /etc/hosts
file on each node has been updated with the names (manager, worker0, worker1, etc.) and IP addresses of the nodes. pdsh
has been installed and /etc/genders
has been configured. You should be able to ssh
to any node as the hadoop
user without a password. Likewise, you should be able to pdsh
commands to groups of nodes as the hadoop user. The pdsh
genders group manager
specifies the manager node, and the worker
group will specify all worker nodes.
Once the cluster is created you can simply stop or start the nodes from the AWS console or Azure portal. Terraform is just for creating, updating, or destroying the resources. ZooKeeper and Hadoop are setup to use SystemD service files, but Accumulo is not. You could log into the manager node and run accumulo-cluster stop
before stopping the nodes. Or, you could just shut them down and force Accumulo to recover (which might be good for testing). When restarting the nodes from the AWS Console/Azure Portal, ZooKeeper and Hadoop should start on their own. For Accumulo, you should only need to run accumulo-cluster start
on the manager node.