docs/get-started/3.installation.md

id: ‘installation’
title: ‘Installation’
sidebar_position: 3

import { InstallationEnvs } from ‘../components/TableData.jsx’;

1. Preparation

To successfully install StreamPark, the following environments need to be prepared:

2. Download

You can download the latest version of StreamPark directly from the official website. The version used in this document is 2.1.5, available at: https://streampark.apache.org/download

After the download is complete, extract the package:

# Extract the StreamPark installation package.
tar -zxvf apache-streampark_2.12-2.1.5-incubating-bin.tar.gz

The extracted directory will look like this:

├── bin
│    ├── startup.sh                              // Startup script
│    ├── shutdown.sh                             // Shutdown script
│    └── ......
├── conf
│    ├── config.yaml                             // Project configuration file
│    └── logback-spring.xml                      // Logging configuration file
├── lib
│    └── *.jar                                  // Project JAR files
├── logs                                        // Log directory
├── script
│    ├── data
│    │   ├── mysql-data.sql                     // MySQL DDL SQL scripts
│    │   └── pgsql-data.sql                     // PostgreSQL DDL SQL scripts
│    ├── schema
│    │   ├── mysql-schema.sql                   // MySQL full initialization data
│    │   └── pgsql-schema.sql                   // PostgreSQL full initialization data
│    └── upgrade
│        ├── mysql
│        │   ├── 1.2.3.sql                      // SQL script for upgrading to version 1.2.3
│        │   ├── 2.0.0.sql                      // SQL script for upgrading to version 2.0.0
│        │   ├── ......
│        └── pgsql
│            └── ......  
└── temp                                         // Temporary directory for internal use, do not delete

3. Start

Navigate to the bin directory under the installation directory and start the program:

# Go to the bin directory under the installation directory
cd bin

# Start the program
./startup.sh

:::danger Warning After starting the program, you may encounter the following error:
streampark.workspace.local: "/tmp/streampark" is an invalid path, please reconfigure in xxx/conf/config.yaml
:::

This is because the local workspace directory for StreamPark does not exist or is not set to a valid path. The solution is simple: you can use the default configuration and create the streampark directory under /tmp:

mkdir -p /tmp/streampark

Alternatively, configure a valid local path in the streampark.workspace.local property in the conf/config.yaml file.

Run startup.sh again, and you should see that the program starts successfully: startup-success

Visit StreamPark at: http://127.0.0.1:10000

:::tip Tip Username and password: admin / streampark ::: login-page

You should see Login successful! login-success

4. More

At this point, you can see that the project is running successfully. The installation above is based on the default H2 local database. Let’s now look at how to integrate external databases (MySQL or PostgreSQL).

4.1 Use External Database

If you want to use an external database, you need to modify the conf/config.yaml file in the installation directory. The core modification is as follows (using MySQL as an example):

datasource:
  dialect: mysql  # Change this to mysql, default is h2, supports mysql and pgsql
  h2-data-dir: ~/streampark/h2-data  # If using H2, configure this directory. If using MySQL or PostgreSQL, configure the remaining information (username and password)
  username:  # Username for the data source connection
  password:  # Password for the data source connection
  url: # JDBC connection URL, e.g., jdbc:mysql://localhost:3306/streampark?useUnicode=true&characterEncoding=UTF-8&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=GMT%2B8

Next, manually connect to the external database and initialize the MySQL table creation script (location: installation_directory/script/schema/mysql-schema.sql) and the data initialization script (location: installation_directory/script/data/mysql-data.sql).

Once the above steps are complete, start the service, and it will automatically connect and use the external database.

4.2 Use Hadoop

If you need to deploy jobs to YARN, you must configure the Hadoop environment variables. If you are using a Hadoop environment installed via CDH, the relevant environment variables can be configured as follows:

export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop # Hadoop installation directory
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HIVE_HOME=$HADOOP_HOME/../hive
export HBASE_HOME=$HADOOP_HOME/../hbase
export HADOOP_HDFS_HOME=$HADOOP_HOME/../hadoop-hdfs
export HADOOP_MAPRED_HOME=$HADOOP_HOME/../hadoop-mapreduce
export HADOOP_YARN_HOME=$HADOOP_HOME/../hadoop-yarn

:::tip Tip StreamPark will automatically read the Hadoop configuration from the environment variables, connect to Hadoop, upload resources to HDFS, and deploy jobs to YARN. :::

In addition, you may need to modify the conf/config.yaml file (for example, specifying the hadoop user, Kerberos authentication, etc.). The core modifications are as follows:

streampark:
  workspace:
    # Root path for storing resources in HDFS
    remote: hdfs:///streampark/
  proxy:
    # Hadoop YARN proxy URL, e.g., Knox process address https://streampark.com:8443/proxy/yarn
    yarn-url:
  yarn:
    # Authentication type
    http-auth: 'simple'  # Default is simple or kerberos
  # Flink on YARN or Spark on YARN, HADOOP_USER_NAME
  hadoop-user-name: hdfs

# If Kerberos authentication is enabled, configure the following:
security:
  kerberos:
    login:
      debug: false
      enable: false
      keytab:
      krb5:
      principal:

4.3 Other Configurations

Here is the detailed explanation of the config.yaml file configuration, allowing you to easily integrate SSO or LDAP, etc.:

########################################### Logging Configuration ###########################################
logging:
  level:
    root: info  

########################################### Basic Service Configuration ###########################################
server:
  port: 10000  
  session:
    ttl: 2h  # User login session expiration time. Users will be automatically logged out after this period.
  undertow:  # Undertow service settings
    buffer-size: 1024  
    direct-buffers: true  
    threads:
      io: 16  
      worker: 256  

########################################### Database Configuration ###########################################
datasource:
  dialect: h2  # Database dialect, supports h2, mysql, pgsql
  h2-data-dir: ~/streampark/h2-data  # If using H2, configure this directory
  username:  
  password:  
  url:  # Database connection URL, e.g., jdbc:mysql://localhost:3306/streampark?......

########################################### Project Configuration ###########################################
streampark:
  ## Workspace configuration, local and HDFS workspaces for different types of resources
  workspace:
    local: /tmp/streampark  
    remote: hdfs:///streampark/ 
  ## Proxy settings
  proxy:
    lark-url:  # Feishu proxy URL
    yarn-url:  # YARN proxy URL
  ## YARN-related configuration
  yarn:
    http-auth: 'simple'  # Authentication method
  hadoop-user-name: hdfs  # Configure the Hadoop username
  ## Project management settings
  project:
    max-build: 16  # Maximum number of projects to run concurrently.
  ## Development interface settings
  openapi:
    white-list:  # Configure the white list for open APIs

########################################### Kerberos Authentication Configuration ###########################################
security:
  kerberos:
    login:
      debug: false  
      enable: false  
      keytab: 
      krb5:  
      principal:  
    ttl: 2h  

########################################

Looking forward to your further exploration~

id: ‘installation’title: ‘Installation’sidebar_position: 3