Operations Guide

This guide covers monitoring, troubleshooting, backup & recovery, and operational procedures for HugeGraph Store in production.

Monitoring and Metrics
Common Issues and Troubleshooting
Backup and Recovery
Capacity Management
Rolling Upgrades

Monitoring and Metrics

Metrics Endpoints

Store Node Metrics:

# Health check
curl http://<store-host>:8520/actuator/health

# All metrics
curl http://<store-host>:8520/actuator/metrics

# Specific metric
curl http://<store-host>:8520/actuator/metrics/jvm.memory.used

PD Metrics:

curl http://<pd-host>:8620/actuator/metrics

Key Metrics to Monitor

1. Raft Metrics

Metric: raft.leader.election.count

Description: Number of leader elections
Normal: 0-1 per hour (initial election)
Warning: >5 per hour (network issues or node instability)

Metric: raft.log.apply.latency

Description: Time to apply Raft log entries (ms)
Normal: <10ms (p99)
Warning: >50ms (disk I/O bottleneck)

Metric: raft.snapshot.create.duration

Description: Snapshot creation time (ms)
Normal: <30,000ms (30 seconds)
Warning: >60,000ms (large partition or slow disk)

2. RocksDB Metrics

Metric: rocksdb.read.latency

Description: RocksDB read latency (microseconds)
Normal: <1000μs (1ms) for p99
Warning: >5000μs (5ms) - check compaction or cache hit rate

Metric: rocksdb.write.latency

Description: RocksDB write latency (microseconds)
Normal: <2000μs (2ms) for p99
Warning: >10000μs (10ms) - check compaction backlog

Metric: rocksdb.compaction.pending

Description: Number of pending compactions
Normal: 0-2
Warning: >5 (write stall likely)

Metric: rocksdb.block.cache.hit.rate

Description: Block cache hit rate (%)
Normal: >90%
Warning: <70% (increase cache size)

3. Partition Metrics

Metric: partition.count

Description: Number of partitions on this Store node
Normal: Evenly distributed across nodes
Warning: >2x average (rebalancing needed)

Metric: partition.leader.count

Description: Number of Raft leaders on this node
Normal: ~partitionCount / 3 (for 3 replicas)
Warning: 0 (node cannot serve writes)

Queries:

# Check partition distribution
curl  http://localhost:8620/v1/partitionsAndStats 

# Example output (imbalanced):
# {
#   {
#   "partitions": {}, 
#   "partitionStats: {}"
#   }
# }

4. gRPC Metrics

Metric: grpc.request.qps

Description: Requests per second
Normal: Depends on workload
Warning: Sudden drops (connection issues)

Metric: grpc.request.latency

Description: gRPC request latency (ms)
Normal: <20ms for p99
Warning: >100ms (network or processing bottleneck)

Metric: grpc.error.rate

Description: Error rate (errors/sec)
Normal: <1% of QPS
Warning: >5% (investigate errors)

5. System Metrics

Disk Usage:

# Check Store data directory
df -h | grep storage

# Recommended: <80% full
# Warning: >90% full

Memory Usage:

# JVM heap usage
curl http://192.168.1.20:8520/actuator/metrics/jvm.memory.used

# RocksDB memory (block cache + memtables)
curl http://192.168.1.20:8520/actuator/metrics/rocksdb.memory.usage

CPU Usage:

# Overall CPU
top -p $(pgrep -f hugegraph-store)

# Recommended: <70% average
# Warning: >90% sustained

Prometheus Integration

Configure Prometheus (prometheus.yml):

scrape_configs:
  - job_name: 'hugegraph-store'
    static_configs:
      - targets:
          - '192.168.1.20:8520'
          - '192.168.1.21:8520'
          - '192.168.1.22:8520'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s

Grafana Dashboard: Import HugeGraph Store dashboard (JSON available in project)

Alert Rules

Example Prometheus Alerts (alerts.yml):

groups:
  - name: hugegraph-store
    rules:
      # Raft leader elections too frequent
      - alert: FrequentLeaderElections
        expr: rate(raft_leader_election_count[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Frequent Raft leader elections on {{ $labels.instance }}"

      # RocksDB write stall
      - alert: RocksDBWriteStall
        expr: rocksdb_compaction_pending > 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "RocksDB write stall on {{ $labels.instance }}"

      # Disk usage high
      - alert: HighDiskUsage
        expr: disk_used_percent > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage >85% on {{ $labels.instance }}"

      # Store node down
      - alert: StoreNodeDown
        expr: up{job="hugegraph-store"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Store node {{ $labels.instance }} is down"

Common Issues and Troubleshooting

Issue 1: Raft Leader Election Failures

Symptoms:

Write requests fail with “No leader”
Frequent leader elections in logs
raft.leader.election.count metric increasing rapidly

Diagnosis:

# Check Store logs
tail -f logs/hugegraph-store.log | grep "Raft election"

# Check network latency between Store nodes
ping 192.168.1.21
ping 192.168.1.22

# Check Raft status (via PD)
curl http://192.168.1.10:8620/pd/v1/partitions | jq '.[] | select(.leader == null)'

Root Causes:

Network Partition: Store nodes cannot communicate
High Latency: Network latency >50ms between nodes
Disk I/O Stall: Raft log writes timing out
Clock Skew: System clocks out of sync

Solutions:

Fix Network: Check switches, firewalls, routing
Reduce Latency: Deploy nodes in same datacenter/zone
Check Disk: Use iostat -x 1 to check disk I/O
Sync Clocks: Use NTP to synchronize system clocks
```
ntpdate -u pool.ntp.org
```

Issue 2: Partition Imbalance

Symptoms:

Some Store nodes have 2x more partitions than others
Uneven disk usage across Store nodes
Some nodes overloaded, others idle

Diagnosis:

# Check partition distribution
curl  http://localhost:8620/v1/partitionsAndStats 

# Example output (imbalanced):
# {
#   {
#   "partitions": {}, 
#   "partitionStats: {}"
#   }
# }

Root Causes:

New Store Added: Partitions not yet rebalanced
PD Patrol Disabled: Auto-rebalancing not running
Rebalancing Too Slow: patrol-interval too high

Solutions:

Trigger Manual Rebalance (via PD API):

curl http://192.168.1.10:8620/v1/balanceLeaders

Reduce Patrol Interval (in PD application.yml):

pd:
  patrol-interval: 600  # Rebalance every 10 minutes (instead of 30)

Check PD Logs:

tail -f logs/hugegraph-pd.log | grep "balance"

Wait: Rebalancing is gradual (may take hours for large datasets)

Issue 3: Data Migration Slow

Symptoms:

Partition migration takes hours
Raft snapshot transfer stalled
High network traffic but low progress

Diagnosis:

# Check Raft snapshot status
tail -f logs/hugegraph-store.log | grep snapshot

# Check network throughput
iftop -i eth0

# Check disk I/O during snapshot
iostat -x 1

Root Causes:

Large Partitions: Partitions >10GB take long to transfer
Network Bandwidth: Limited bandwidth (<100Mbps)
Disk I/O: Slow disk on target Store

Solutions:

Increase Snapshot Interval (reduce snapshot size):

raft:
  snapshotInterval: 900  # Snapshot every 15 minutes

Increase Network Bandwidth: Use 1Gbps+ network
Parallelize Migration: PD migrates one partition at a time by default
- Edit PD configuration to allow concurrent migrations (advanced)

Monitor Progress:

# Check partition state transitions
curl http://192.168.1.10:8620/v1/partitions | grep -i migrating

Issue 4: RocksDB Performance Degradation

Symptoms:

Query latency increasing over time
rocksdb.read.latency >5ms
rocksdb.compaction.pending >5

Diagnosis:

# Check Store logs for compaction
tail -f logs/hugegraph-store.log | grep compaction

Root Causes:

Write Amplification: Too many compactions
Low Cache Hit Rate: Block cache too small
SST File Proliferation: Too many SST files in L0

Solutions:

Increase Block Cache (in application-pd.yml):

rocksdb:
  block_cache_size: 32000000000  # 32GB (from 16GB)

Increase Write Buffer (reduce L0 files):

rocksdb:
  write_buffer_size: 268435456  # 256MB (from 128MB)
  max_write_buffer_number: 8    # More memtables

Restart Store Node (last resort, triggers compaction on startup):
```
bin/stop-hugegraph-store.sh
bin/start-hugegraph-store.sh
```

Issue 5: Store Node Unresponsive

Symptoms:

gRPC requests timing out
Health check fails
CPU or memory at 100%

Diagnosis:

# Check if process is alive
ps aux | grep hugegraph-store

# Check CPU/memory
top -p $(pgrep -f hugegraph-store)

# Check logs
tail -100 logs/hugegraph-store.log

# Check for OOM killer
dmesg | grep -i "out of memory"

# Check disk space
df -h

Root Causes:

Out of Memory (OOM): JVM heap exhausted
Disk Full: No space for Raft logs or RocksDB writes
Thread Deadlock: Internal deadlock in Store code
Network Saturation: Too many concurrent requests

Solutions:

OOM:
- Increase JVM heap: Edit start-hugegraph-store.sh, set Xmx32g
- Restart Store node

Disk Full:

Clean up old Raft snapshots:

rm -rf storage/raft/partition-*/snapshot/*  # Keep only latest

Add more disk space

Thread Deadlock:
- Take thread dump:
```
jstack $(pgrep -f hugegraph-store) > threaddump.txt
```
- Restart Store node
- Report to HugeGraph team with thread dump
Network Saturation:
- Check connection count:
```
netstat -an | grep :8500 | wc -l
```
- Reduce store.max_sessions in Server config
- Add more Store nodes to distribute load

Backup and Recovery

Backup Strategies

Strategy 1: Snapshot-Based Backup

Frequency: Daily or weekly

Process:

# On each Store node
cd storage

# Create snapshot (Raft snapshots)
# Snapshots are automatically created by Raft every `snapshotInterval` seconds
# Locate latest snapshot:
find raft/partition-*/snapshot -name "snapshot_*" -type d | sort | tail -5

# Copy to backup location
tar -czf backup-store1-$(date +%Y%m%d).tar.gz raft/partition-*/snapshot/*

# Upload to remote storage
scp backup-store1-*.tar.gz backup-server:/backups/

Pros:

Fast backup (no downtime)
Point-in-time recovery

Cons:

Requires all Store nodes to be backed up
May miss recent writes (since last snapshot)

Disaster Recovery Procedures

Scenario 1: Single Store Node Failure

Impact: Partitions with replicas on this node lose one replica

Action:

No immediate action needed: Remaining replicas continue serving

Monitor: Check if Raft leaders re-elected

curl http://192.168.1.10:8620/v1/partitions | grep leader

Replace Failed Node:
- Deploy new Store node with same configuration
- PD automatically assigns partitions to new node
- Wait for data replication (may take hours)

Verify: Check partition distribution

 curl  http://localhost:8620/v1/partitionsAndStats

Scenario 2: Complete Store Cluster Failure

Impact: All data inaccessible

Action:

Restore PD Cluster (if also failed):
- Deploy 3 new PD nodes
- Restore PD metadata from backup
- Start PD nodes
Restore Store Cluster:
- Deploy 3 new Store nodes
- Extract backup on each node:
```
cd storage
tar -xzf /backups/backup-store1-20250129.tar.gz
```
Start Store Nodes:
```
bin/start-hugegraph-store.sh
```

Verify Data:

# Check via Server
curl http://192.168.1.30:8080/graphspaces/{graphspaces_name}/graphs/{graph_name}/vertices?limit=10

Scenario 3: Data Corruption

Impact: RocksDB corruption on one or more partitions

Action:

Identify Corrupted Partition:

# Check logs for corruption errors
tail -f logs/hugegraph-store.log | grep -i corrupt

Stop Store Node:
```
bin/stop-hugegraph-store.sh
```

Delete Corrupted Partition Data:

# Assuming partition 5 is corrupted
rm -rf storage/raft/partition-5

Restart Store Node:
```
bin/start-hugegraph-store.sh
```
Re-replicate Data:
- Raft automatically re-replicates from healthy replicas
- Monitor replication progress:
```
tail -f logs/hugegraph-store.log | grep "snapshot install"
```

Capacity Management

Monitoring Capacity

Disk Usage:

# Per Store node
du -sh storage/

# Expected growth rate: Track over weeks

Partition Count:

# Current partition count
curl http://192.168.1.10:8620/v1/partitionsAndStatus

# Recommendation: 3-5x Store node count
# Example: 6 Store nodes → 18-30 partitions

Adding Store Nodes

When to Add:

Disk usage >80% on existing nodes
CPU usage >70% sustained
Query latency increasing

Process:

Deploy New Store Node:

# Follow deployment guide
tar -xzf apache-hugegraph-store-incubating-1.7.0.tar.gz
cd apache-hugegraph-store-incubating-1.7.0

# Configure and start
vi conf/application.yml
bin/start-hugegraph-store.sh

Verify Registration:

curl http://192.168.1.10:8620/v1/stores
# New Store should appear

Trigger Rebalancing (optional):

curl -X POST http://192.168.1.10:8620/v1/balanceLeaders

Monitor Rebalancing:

# Watch partition distribution
watch -n 10 'curl http://192.168.1.10:8620/v1/partitionsAndStatus'

Verify: Wait for even distribution (may take hours)

Removing Store Nodes

When to Remove:

Decommissioning hardware
Downsizing cluster (off-peak hours)

Process:

Mark Store for Removal (via PD API):

curl --location --request POST 'http://localhost:8080/store/123' \
--header 'Content-Type: application/json' \
--data-raw '{
"storeState": "Off"
}'

Refer to API definition in StoreAPI::setStore

Wait for Migration:
- PD migrates all partitions off this Store
Stop Store Node:
```
bin/stop-hugegraph-store.sh
```
Remove from PD (optional):

Rolling Upgrades

Upgrade Strategy

Goal: Upgrade cluster with zero downtime

Prerequisites:

Version compatibility: Check release notes
Backup: Take full backup before upgrade
Testing: Test upgrade in staging environment

Upgrade Procedure

Step 1: Upgrade Store Nodes (one at a time)

Node 1:

# Stop Store node
bin/stop-hugegraph-store.sh

# Backup current version
mv apache-hugegraph-store-incubating-1.7.0 apache-hugegraph-store-incubating-1.7.0-backup

# Extract new version
tar -xzf apache-hugegraph-store-incubating-1.8.0.tar.gz
cd apache-hugegraph-store-incubating-1.8.0

# Copy configuration from backup
cp ../apache-hugegraph-store-incubating-1.7.0-backup/conf/application.yml conf/

# Start new version
bin/start-hugegraph-store.sh

# Verify
curl http://192.168.1.20:8520/v1/health
tail -f logs/hugegraph-store.log

Wait 5-10 minutes, then repeat for Node 2, then Node 3.

Step 2: Upgrade PD Nodes (one at a time)

Same process as Store, but upgrade PD cluster first or last (check release notes).

Step 3: Upgrade Server Nodes (one at a time)

# Stop Server
bin/stop-hugegraph.sh

# Upgrade and restart
# (same process as Store)

bin/start-hugegraph.sh

Rollback Procedure

If upgrade fails:

# Stop new version
bin/stop-hugegraph-store.sh

# Restore backup
rm -rf apache-hugegraph-store-incubating-1.8.0
mv apache-hugegraph-store-incubating-1.7.0-backup apache-hugegraph-store-incubating-1.7.0
cd apache-hugegraph-store-incubating-1.7.0

# Restart old version
bin/start-hugegraph-store.sh

For performance tuning, see Best Practices.

For development and debugging, see Development Guide.

Operations Guide

Table of Contents

Monitoring and Metrics

Metrics Endpoints

Key Metrics to Monitor

1. Raft Metrics

2. RocksDB Metrics

3. Partition Metrics

4. gRPC Metrics

5. System Metrics

Prometheus Integration

Alert Rules

Common Issues and Troubleshooting

Issue 1: Raft Leader Election Failures

Issue 2: Partition Imbalance

Issue 3: Data Migration Slow

Issue 4: RocksDB Performance Degradation

Issue 5: Store Node Unresponsive

Backup and Recovery

Backup Strategies

Strategy 1: Snapshot-Based Backup

Disaster Recovery Procedures

Scenario 1: Single Store Node Failure

Scenario 2: Complete Store Cluster Failure

Scenario 3: Data Corruption

Capacity Management

Monitoring Capacity

Adding Store Nodes

Removing Store Nodes

Rolling Upgrades

Upgrade Strategy

Upgrade Procedure

Step 1: Upgrade Store Nodes (one at a time)

Step 2: Upgrade PD Nodes (one at a time)

Step 3: Upgrade Server Nodes (one at a time)

Rollback Procedure