Advanced Deployment Configuration

This guide provides detailed instructions for advanced deployment scenarios and customizations of Apache Ambari and Hadoop ecosystem components.

Configuration Workflow

When using non-Docker deployment with advanced configurations:

After modifying conf/base_conf.yml, always regenerate the advanced configuration:
```
source setup_pypath.sh
python3 deploy_py/main.py -generate-conf
```
Modify the generated conf/conf.yml according to your needs

Start deployment:

nohup python3 deploy_py/main.py -deploy &
tail -f logs/ansible-playbook.log

Important Note: Running python3 deploy_py/main.py -generate-conf again will overwrite your customized conf/conf.yml.

Advanced Configuration Scenarios

1. Customizing Cluster Topology

The cluster topology is primarily controlled by two configuration sections in conf.yml: host_groups and group_services.

Host Groups Configuration

host_groups:
  group0: [server0]
  group1: [server1]
  group2: [server2,server3]

Each group contains a set of hostnames that will be treated as a unit for service deployment.

Service Groups Configuration

group_services:
  group0: [AMBARI_SERVER, NAMENODE, ZKFC, ...]
  group1: [NAMENODE, ZKFC, JOURNALNODE, ...]
  group2: [ZOOKEEPER_SERVER, JOURNALNODE, ...]

2. Component Reference Guide

Here's a comprehensive list of available components and their functions:

HDFS Components

NAMENODE (HA capable)
- Core component managing filesystem metadata
- Deploy 2 instances for HA mode
DATANODE
- Stores actual data blocks
JOURNALNODE
- Required for HA: minimum 3 instances (odd number)
ZKFC
- Must be co-located with NAMENODE
- Required for HA: 2 instances

YARN Components

RESOURCEMANAGER (HA capable)
- Central resource management
- Deploy 2 instances for HA mode
NODEMANAGER
- Manages resources on individual nodes
HISTORYSERVER
- Stores job history
APP_TIMELINE_SERVER
- Application timeline tracking

HBase Components

HBASE_MASTER (HA capable)
- Cluster management and coordination
- Deploy 2 instances for HA mode
HBASE_REGIONSERVER
- Data storage and management

Hive Components

HIVE_METASTORE
- Metadata storage
HIVE_SERVER (HA capable)
- Query processing
WEBHCAT_SERVER
- REST interface

Other Core Components

ZOOKEEPER_SERVER
- Must deploy odd number of instances
KAFKA_BROKER
- Message broker
SPARK_JOBHISTORYSERVER
- Spark job history
FLINK_HISTORYSERVER
- Flink job history

3. Using External Databases

To use external databases, you need to:

Create necessary users and databases manually:

-- PostgreSQL example
CREATE USER hive WITH PASSWORD 'hive';
CREATE DATABASE hive OWNER hive;
GRANT ALL PRIVILEGES ON DATABASE hive TO hive;

CREATE USER ranger WITH PASSWORD 'ranger';
CREATE DATABASE ranger OWNER ranger;
GRANT ALL PRIVILEGES ON DATABASE ranger TO ranger;

Configure database settings in conf.yml:

database: 'postgres'
database_options:
  external_hostname: 'your-db-host'
  hive_db_name: 'hive'
  hive_db_username: 'hive'
  hive_db_password: 'your-password'
  rangeradmin_db_name: 'ranger'
  rangeradmin_db_username: 'ranger'
  rangeradmin_db_password: 'your-password'

4. Customizing Directory Locations

You can customize various directory locations:

data_dirs: ["/data/sdv1"]  # Base data directory

# HDFS directories
hdfs_dfs_namenode_name_dir: "{{ hadoop_base_dir }}/hdfs/namenode"
hdfs_dfs_datanode_data_dir: "{% for dr in data_dirs %}{{ dr }}/hadoop/hdfs/data{% if not loop.last %},{% endif %}{% endfor %}"

# YARN directories
yarn_nodemanager_local_dirs: "{{ hadoop_base_dir }}/yarn/local"
yarn_nodemanager_log_dirs: "{{ hadoop_base_dir }}/yarn/log"

# Other service directories
zookeeper_data_dir: "{{ hadoop_base_dir }}/zookeeper"
kafka_log_base_dir: "{% for dr in data_dirs %}{{ dr }}/kafka-logs{% if not loop.last %},{% endif %}{% endfor %}"

5. Enabling Ranger

To enable Ranger and its plugins:

ranger_options:
  enable_plugins: yes

ranger_security_options:
  ranger_admin_password: "your-password"
  ranger_keyadmin_password: "your-password"
  kms_master_key_password: "your-password"

6. Customizing Ambari Configuration

Modify Ambari-specific settings:

ambari_options:
  ambari_agent_run_user: 'ambari'
  ambari_server_run_user: 'ambari'
  ambari_admin_user: 'admin'
  ambari_admin_password: 'your-password'
  config_recommendation_strategy: 'ALWAYS_APPLY'

Best Practices

High Availability Setup
- Deploy NAMENODE, RESOURCEMANAGER, and HBASE_MASTER in pairs for HA
- Use odd number of JOURNALNODE and ZOOKEEPER_SERVER instances
- Co-locate ZKFC with NAMENODE
Resource Planning
- Distribute memory-intensive services across nodes
- Consider network topology when placing services
- Plan for future scaling
Security Considerations
- Use external databases for production
- Enable Ranger for centralized security
- Configure proper authentication mechanisms

Complete Configuration Reference

Below is a complete configuration example with detailed explanations for each section:

############################
## Host Groups & Services ##
############################
# Define machine groups and their members
host_groups:
  group0: [e977d8bea74d.bigtop.apache.org]
  group1: [b76d76c80f15.bigtop.apache.org]
  group2: [8874239dee4b.bigtop.apache.org]

# Define which services run on which groups
group_services:
  group0: [AMBARI_SERVER, NAMENODE, ZKFC, JOURNALNODE, RESOURCEMANAGER, ZOOKEEPER_SERVER,
    HBASE_MASTER, HIVE_METASTORE, SPARK_THRIFTSERVER, FLINK_HISTORYSERVER, HISTORYSERVER,
    RANGER_TAGSYNC, RANGER_USERSYNC]
  group1: [NAMENODE, ZKFC, JOURNALNODE, RESOURCEMANAGER, ZOOKEEPER_SERVER, HBASE_MASTER,
    DATANODE, NODEMANAGER, APP_TIMELINE_SERVER, RANGER_ADMIN, METRICS_GRAFANA, SPARK_JOBHISTORYSERVER,
    INFRA_SOLR]
  group2: [ZOOKEEPER_SERVER, JOURNALNODE, DATANODE, NODEMANAGER, TIMELINE_READER,
    YARN_REGISTRY_DNS, METRICS_COLLECTOR, HBASE_REGIONSERVER, HIVE_SERVER, WEBHCAT_SERVER,
    INFRA_SOLR]

############################
## Basic Configuration    ##
############################
# Default password for all services
default_password: B767610qa4Z

# Data directories for all components
# Can specify multiple directories for HDFS DataNode storage
data_dirs: [/data/sdv1]

# Repository configuration
# Option 1: Use existing repository
repos: null
# Option 2: Use local package directory
repo_pkgs_dir: /data1/apache/ambari-3.0_pkgs

# Stack version for Ambari
stack_version: 3.3.0

# Cluster naming
cluster_name: cluster
hdfs_ha_name: ambari-cluster

# Network configuration
ansible_ssh_port: 22
ambari_server_port: 8083
http_repo_port: 8881

############################
## Docker Configuration   ##
############################
docker_options:
  # Minimum 3 instances required for HA
  instance_num: 3
  # Memory limit per container
  memory_limit: 8g
  # Enable local repository
  enable_local_repo: true
  # Port mappings for accessing services
  components_port_map: {AMBARI_SERVER: 8083}
  # Container distribution settings
  distro: {name: centos, version: 8}
  # Components to install in Docker environment
  components: [hbase, hdfs, yarn, hive, zookeeper, ambari, spark, flink, ranger, infra_solr,
    ambari_metrics]
  default_password: B767610qa4Z

############################
## Memory Configuration  ##
############################
# Component memory settings (in MB)
hbase_heapsize: 1024
hadoop_heapsize: 1024
hive_heapsize: 1024
infra_solr_memory: 1024
spark_daemon_memory: 1024
zookeeper_heapsize: 1024
yarn_heapsize: 1024
alluxio_memory: 1024

############################
## Repository Settings   ##
############################
skip_cluster_clear: true
local_repo_ipaddress: 172.30.0.3
create_http_repo_for_local_pkgs: false

############################
## Deployment Control    ##
############################
deploy_ambari_only: false
prepare_nodes_only: false
backup_old_repo: no
should_deploy_ambari_mpack: false

############################
## Database Configuration ##
############################
# Database type selection
database: 'postgres'                                      # Options: 'postgres', 'mysql'
postgres_port: 5432
mysql_port: 3306

# Database options for all components
database_options:
  # External database configuration
  repo_url: ''
  external_hostname: ''                                   # Empty for local database installation

  # Ambari database
  ambari_db_name: 'ambari'
  ambari_db_username: 'ambari'
  ambari_db_password: '{{ default_password }}'

  # Hive database
  hive_db_name: 'hive'
  hive_db_username: 'hive'
  hive_db_password: '{{ default_password }}'

  # Ranger databases
  rangeradmin_db_name: 'ranger'
  rangeradmin_db_username: 'ranger'
  rangeradmin_db_password: '{{ default_password }}'
  rangerkms_db_name: 'rangerkms'
  rangerkms_db_username: 'rangerkms'
  rangerkms_db_password: '{{ default_password }}'

  # Other component databases
  dolphin_db_name: 'dolphinscheduler'
  dolphin_db_username: 'dolphin'
  dolphin_db_password: '{{ default_password }}'

  superset_db_name: 'superset'
  superset_db_username: 'superset'
  superset_db_password: '{{ default_password }}'

  cloudbeaver_db_name: 'cloudbeaver'
  cloudbeaver_db_username: 'cloudbeaver'
  cloudbeaver_db_password: '{{ default_password }}'

  nightingale_db_name: 'nightingale'
  nightingale_db_username: 'n9e'
  nightingale_db_password: '{{ default_password }}'

############################
## Security Configuration ##
############################
# Security type
security: 'none'                                         # Options: 'none', 'mit-kdc'

# Kerberos security options
security_options:
  external_hostname: ''                                   # Empty for local KDC installation
  external_hostip: ''                                    # For /etc/hosts DNS lookup
  realm: 'MY-REALM.COM'
  admin_principal: 'admin/admin'                         # Kerberos admin principal
  admin_password: "{{ default_password }}"
  kdc_master_key: "{{ default_password }}"               # Only for 'mit-kdc'
  http_authentication: yes                               # Enable HTTP authentication
  manage_krb5_conf: yes                                  # Set to no for FreeIPA/IdM

############################
## Ambari Configuration  ##
############################
ambari_options:
  # Run users
  ambari_agent_run_user: 'ambari'
  ambari_server_run_user: 'ambari'
  # Admin user settings
  ambari_admin_user: 'admin'
  ambari_admin_password: '{{ default_password }}'
  ambari_admin_default_password: 'admin'
  # Configuration strategy
  config_recommendation_strategy: 'ALWAYS_APPLY'          # Options: 'NEVER_APPLY', 'ONLY_STACK_DEFAULTS_APPLY', 
                                                        # 'ALWAYS_APPLY', 'ALWAYS_APPLY_DONT_OVERRIDE_CUSTOM_VALUES'

############################
## Ranger Configuration  ##
############################
# Ranger plugin options
ranger_options:
  enable_plugins: no                                     # Enable plugins for installed services

# Ranger security settings
ranger_security_options:
  ranger_admin_password: "{{ default_password }}"        # Password for admin users
  ranger_keyadmin_password: "{{ default_password }}"     # Password for keyadmin (HDP3 only)
  kms_master_key_password: "{{ default_password }}"      # Master key encryption password

############################
## General Configuration ##
############################
# System settings
external_dns: yes                                        # Use existing DNS or update /etc/hosts
disable_firewall: yes                                    # Disable local firewall service
timezone: Asia/Shanghai

# NTP configuration
external_ntp_server_hostname: ''                         # Empty for local NTP server

# Additional settings
packages_need_install: []
registry_dns_bind_port: "54"
blueprint_name: 'blueprint'                              # Blueprint name in Ambari
wait: true                                              # Wait for cluster installation
wait_timeout: 60
accept_gpl: yes                                         # Accept GPL licenses

############################
## Directory Configuration##
############################
# Base directories
base_log_dir: "/var/log"
base_tmp_dir: "/tmp"

# Service data directories
kafka_log_base_dir: "{% for dr in data_dirs %}{{ dr }}/kafka-logs{% if not loop.last %},{% endif %}{% endfor %}"
ams_base_dir: "/var/lib"
ranger_audit_hdfs_filespool_base_dir: "{{ base_log_dir }}"
ranger_audit_solr_filespool_base_dir: "{{ base_log_dir }}"

# HDFS directories
hdfs_dfs_namenode_checkpoint_dir: "{{ hadoop_base_dir }}/hdfs/namesecondary"
hdfs_dfs_namenode_name_dir: "{{ hadoop_base_dir }}/hdfs/namenode"
hdfs_dfs_journalnode_edits_dir: "{{ hadoop_base_dir }}/hdfs/journalnode"
hdfs_dfs_datanode_data_dir: "{% for dr in data_dirs %}{{ dr }}/hadoop/hdfs/data{% if not loop.last %},{% endif %}{% endfor %}"

# YARN directories
yarn_nodemanager_local_dirs: "{{ hadoop_base_dir }}/yarn/local"
yarn_nodemanager_log_dirs: "{{ hadoop_base_dir }}/yarn/log"
yarn_timeline_leveldb_dir: "{{ hadoop_base_dir }}/yarn/timeline"

# Other service directories
zookeeper_data_dir: "{{ hadoop_base_dir }}/zookeeper"
infra_solr_datadir: "{{ hadoop_base_dir }}/ambari-infra-solr/data"
heap_dump_location: "{{ base_tmp_dir }}"
hive_downloaded_resources_dir: "{{ base_tmp_dir }}/hive/${hive.session.id}_resources"

# Temporary directories
ansible_tmp_dir: /tmp/ansible

Configuration Notes

Host Groups and Services

Each service can be configured for high availability (HA):
- NAMENODE: Deploy 2 instances for HA
- RESOURCEMANAGER: Deploy 2 instances for HA
- HBASE_MASTER: Deploy 2 instances for HA
- HIVE_SERVER: Deploy multiple instances for HA
- ZOOKEEPER_SERVER: Must deploy odd number of instances

Database Configuration

When using external databases:

Create users and databases manually before deployment
Ensure database is accessible from all nodes
Configure connection details in database_options

Directory Configuration

data_dirs: Primary configuration for all data storage
- First directory is used for logs and default paths
- All directories are used for HDFS DataNode storage
- Multiple directories improve I/O performance

Security Configuration

Kerberos setup requires proper DNS resolution
Ranger plugins can be enabled for all compatible services
HTTP authentication can be enabled for web UIs

Memory Configuration

All memory settings are in MB
Configure based on available hardware
Consider service co-location when setting values