blob: c1ea16c35fb8ce39eff49208f3d50d69dbd0220e [file] [log] [blame] [view]
# Concepts
## Cluster
### Definition
A cluster is a logical unit composed of a group of physical or virtual hosts, used to host the distributed runtime environment for big data components. Each cluster has independent configuration space and resource isolation boundaries.
### Key Features
* **Multi Cluster**: A single Server instance can manage multiple clusters simultaneously (e.g., production clusters, test clusters, etc.).
* **Host Binding**: Each Host can only belong to one cluster.
### Stack
### Stack
#### Definition
A predefined standardized service collection that includes installation scripts, configuration templates, and dependency relationship descriptions.
#### Stack List
| Stack | Description |
|------------|---------------------------------------------------------------------------|
| **Infra** | Services shared by all clusters, such as the monitoring system Prometheus |
| **Bigtop** | Services provided by Apache Bigtop, such as Hadoop/Hive/Spark, etc. |
| **Extra** | Community-provided or custom services, such as SeaTunnel |
### Service
#### Definition
A service unit running on a cluster, representing specific big data services (such as Hadoop/Hive/Spark, etc.).
#### Management
##### Configuration Management
* **Snapshots**: Supports configuration snapshot creation and management.
* **Templates**: Uses Freemarker syntax to dynamically render configuration files.
##### Status Monitoring
* **Heartbeat**: The Agent reports service health status every 30 seconds.
### Component
#### Definition
A runtime instance within a service, corresponding to specific processes or functional modules. Component-level operations (start/stop, etc.) are executed by the Agent.
#### Component Examples
```mermaid
graph TB
Hadoop-->NameNode
Hadoop-->DataNode
Hadoop-->ResourceManager
Kafka-->KB[Kafka Broker]
Solr-->SI[Solr Instance]
SeaTunnel-->SM[SeaTunnel Master]
SeaTunnel-->SW[SeaTunnel Worker]
SeaTunnel-->SL[SeaTunnel Client]
```
## Job
### Job
#### Definition
The smallest schedulable unit initiated by users, representing a complete operation and maintenance target. For example:
`Start Hadoop service`, `Update Spark configuration and restart`, etc.
#### Features
* **Atomicity**: The execution result of a Job has only two states: success or failure.
* **Scope**: Acts on a single cluster.
* **Lifecycle**: Forms a complete operation trajectory from creation to final state change.
### Stage
#### Definition
A logical execution unit decomposed from a Job (Job), corresponding to independent operation steps of service components. For example:
The `Start Hadoop` Job `Start NameNode` Stage, `Start DataNode` Stage, etc.
#### Division Principles
* **Dependency**: Components with startup order constraints must be split into independent Stages (e.g., NameNode needs to start before DataNode).
* **Isolation**: Operations of different component categories must be executed in isolation.
* **Parallelism**: Allows parallel execution of Tasks within the same Stage.
### Task
#### Definition
An execution instance of a Stage (Stage) on a specific host, representing the smallest granularity of operation instructions. For example:
The `Start NameNode` Stage `Start NameNode on host-01` Task, `Start NameNode on host-02` Task.
### Job Scheduling Process
#### Job Generates Stages and Tasks
After users submit operation requests via the REST API:
* The Server parses the request and validates its legitimacy.
* Generates a Stage DAG based on component dependency relationships.
* Generates a set of host-level Tasks for each Stage.
* Persists Job/Stage/Task metadata to the database (status initialized to `PENDING`).
#### Stage Scheduling Phase
The scheduler executes Stages in DAG order:
* Checks the status of preceding Stages (triggered only when all preceding Stages succeed).
* Extracts the set of Tasks in the Stage.
* Batches Tasks to the corresponding hosts for execution by the Agent.
#### Task Execution Phase
Processing flow after the Agent receives a Task:
* **Resource Pre-Check**: Verifies the installation status and dependencies of the target component.
* **Script Execution**: Invokes the predefined component operation script in the Stack.
* **Status Feedback**: Writes task logs in real time and updates the Task status to the Server.
Execution Guarantee Mechanisms:
* **Timeout**: A single Task execution timeout (default 30 minutes) is automatically marked as failed.
* **Retry**: Network exception failures can be automatically retried (up to 3 times).
* **Idempotent**: Re-executing a successful Task will not cause side effects.
#### State Management Mechanism
| State Type | Trigger Condition | Handling Strategy |
|-------------------|--------------------------------------------|---------------------------|
| PENDING | Task created but not scheduled | Wait for invocation |
| RUNNING | Task in execution | Monitor timeout threshold |
| SUCCESSFUL/FAILED | Task execution result | Update component status |
| CANCELED | Task canceled (only exists for Stage/Task) | Cancel subsequent tasks |