Concepts

Cluster

Definition

A cluster is a logical unit composed of a group of physical or virtual hosts, used to host the distributed runtime environment for big data components. Each cluster has independent configuration space and resource isolation boundaries.

Key Features

  • Multi Cluster: A single Server instance can manage multiple clusters simultaneously (e.g., production clusters, test clusters, etc.).
  • Host Binding: Each Host can only belong to one cluster.

Stack

Stack

Definition

A predefined standardized service collection that includes installation scripts, configuration templates, and dependency relationship descriptions.

Stack List

StackDescription
InfraServices shared by all clusters, such as the monitoring system Prometheus
BigtopServices provided by Apache Bigtop, such as Hadoop/Hive/Spark, etc.
ExtraCommunity-provided or custom services, such as SeaTunnel

Service

Definition

A service unit running on a cluster, representing specific big data services (such as Hadoop/Hive/Spark, etc.).

Management

Configuration Management
  • Snapshots: Supports configuration snapshot creation and management.
  • Templates: Uses Freemarker syntax to dynamically render configuration files.
Status Monitoring
  • Heartbeat: The Agent reports service health status every 30 seconds.

Component

Definition

A runtime instance within a service, corresponding to specific processes or functional modules. Component-level operations (start/stop, etc.) are executed by the Agent.

Component Examples

graph TB
Hadoop-->NameNode
Hadoop-->DataNode
Hadoop-->ResourceManager
Kafka-->KB[Kafka Broker]
Solr-->SI[Solr Instance]
SeaTunnel-->SM[SeaTunnel Master]
SeaTunnel-->SW[SeaTunnel Worker]
SeaTunnel-->SL[SeaTunnel Client]

Job

Job

Definition

The smallest schedulable unit initiated by users, representing a complete operation and maintenance target. For example:
Start Hadoop service, Update Spark configuration and restart, etc.

Features

  • Atomicity: The execution result of a Job has only two states: success or failure.
  • Scope: Acts on a single cluster.
  • Lifecycle: Forms a complete operation trajectory from creation to final state change.

Stage

Definition

A logical execution unit decomposed from a Job (Job), corresponding to independent operation steps of service components. For example:
The Start Hadoop Job → Start NameNode Stage, Start DataNode Stage, etc.

Division Principles

  • Dependency: Components with startup order constraints must be split into independent Stages (e.g., NameNode needs to start before DataNode).
  • Isolation: Operations of different component categories must be executed in isolation.
  • Parallelism: Allows parallel execution of Tasks within the same Stage.

Task

Definition

An execution instance of a Stage (Stage) on a specific host, representing the smallest granularity of operation instructions. For example:
The Start NameNode Stage → Start NameNode on host-01 Task, Start NameNode on host-02 Task.

Job Scheduling Process

Job Generates Stages and Tasks

After users submit operation requests via the REST API:

  • The Server parses the request and validates its legitimacy.
  • Generates a Stage DAG based on component dependency relationships.
  • Generates a set of host-level Tasks for each Stage.
  • Persists Job/Stage/Task metadata to the database (status initialized to PENDING).

Stage Scheduling Phase

The scheduler executes Stages in DAG order:

  • Checks the status of preceding Stages (triggered only when all preceding Stages succeed).
  • Extracts the set of Tasks in the Stage.
  • Batches Tasks to the corresponding hosts for execution by the Agent.

Task Execution Phase

Processing flow after the Agent receives a Task:

  • Resource Pre-Check: Verifies the installation status and dependencies of the target component.
  • Script Execution: Invokes the predefined component operation script in the Stack.
  • Status Feedback: Writes task logs in real time and updates the Task status to the Server.

Execution Guarantee Mechanisms:

  • Timeout: A single Task execution timeout (default 30 minutes) is automatically marked as failed.
  • Retry: Network exception failures can be automatically retried (up to 3 times).
  • Idempotent: Re-executing a successful Task will not cause side effects.

State Management Mechanism

State TypeTrigger ConditionHandling Strategy
PENDINGTask created but not scheduledWait for invocation
RUNNINGTask in executionMonitor timeout threshold
SUCCESSFUL/FAILEDTask execution resultUpdate component status
CANCELEDTask canceled (only exists for Stage/Task)Cancel subsequent tasks