Operations Guide

This guide covers monitoring, troubleshooting, backup & recovery, and operational procedures for HugeGraph Store in production.

Table of Contents


Monitoring and Metrics

Metrics Endpoints

Store Node Metrics:

# Health check
curl http://<store-host>:8520/actuator/health

# All metrics
curl http://<store-host>:8520/actuator/metrics

# Specific metric
curl http://<store-host>:8520/actuator/metrics/jvm.memory.used

PD Metrics:

curl http://<pd-host>:8620/actuator/metrics

Key Metrics to Monitor

1. Raft Metrics

Metric: raft.leader.election.count

  • Description: Number of leader elections
  • Normal: 0-1 per hour (initial election)
  • Warning: >5 per hour (network issues or node instability)

Metric: raft.log.apply.latency

  • Description: Time to apply Raft log entries (ms)
  • Normal: <10ms (p99)
  • Warning: >50ms (disk I/O bottleneck)

Metric: raft.snapshot.create.duration

  • Description: Snapshot creation time (ms)
  • Normal: <30,000ms (30 seconds)
  • Warning: >60,000ms (large partition or slow disk)

2. RocksDB Metrics

Metric: rocksdb.read.latency

  • Description: RocksDB read latency (microseconds)
  • Normal: <1000μs (1ms) for p99
  • Warning: >5000μs (5ms) - check compaction or cache hit rate

Metric: rocksdb.write.latency

  • Description: RocksDB write latency (microseconds)
  • Normal: <2000μs (2ms) for p99
  • Warning: >10000μs (10ms) - check compaction backlog

Metric: rocksdb.compaction.pending

  • Description: Number of pending compactions
  • Normal: 0-2
  • Warning: >5 (write stall likely)

Metric: rocksdb.block.cache.hit.rate

  • Description: Block cache hit rate (%)
  • Normal: >90%
  • Warning: <70% (increase cache size)

3. Partition Metrics

Metric: partition.count

  • Description: Number of partitions on this Store node
  • Normal: Evenly distributed across nodes
  • Warning: >2x average (rebalancing needed)

Metric: partition.leader.count

  • Description: Number of Raft leaders on this node
  • Normal: ~partitionCount / 3 (for 3 replicas)
  • Warning: 0 (node cannot serve writes)

Queries:

# Check partition distribution
curl  http://localhost:8620/v1/partitionsAndStats 

# Example output (imbalanced):
# {
#   {
#   "partitions": {}, 
#   "partitionStats: {}"
#   }
# }

4. gRPC Metrics

Metric: grpc.request.qps

  • Description: Requests per second
  • Normal: Depends on workload
  • Warning: Sudden drops (connection issues)

Metric: grpc.request.latency

  • Description: gRPC request latency (ms)
  • Normal: <20ms for p99
  • Warning: >100ms (network or processing bottleneck)

Metric: grpc.error.rate

  • Description: Error rate (errors/sec)
  • Normal: <1% of QPS
  • Warning: >5% (investigate errors)

5. System Metrics

Disk Usage:

# Check Store data directory
df -h | grep storage

# Recommended: <80% full
# Warning: >90% full

Memory Usage:

# JVM heap usage
curl http://192.168.1.20:8520/actuator/metrics/jvm.memory.used

# RocksDB memory (block cache + memtables)
curl http://192.168.1.20:8520/actuator/metrics/rocksdb.memory.usage

CPU Usage:

# Overall CPU
top -p $(pgrep -f hugegraph-store)

# Recommended: <70% average
# Warning: >90% sustained

Prometheus Integration

Configure Prometheus (prometheus.yml):

scrape_configs:
  - job_name: 'hugegraph-store'
    static_configs:
      - targets:
          - '192.168.1.20:8520'
          - '192.168.1.21:8520'
          - '192.168.1.22:8520'
    metrics_path: '/actuator/prometheus'
    scrape_interval: 15s

Grafana Dashboard: Import HugeGraph Store dashboard (JSON available in project)

Alert Rules

Example Prometheus Alerts (alerts.yml):

groups:
  - name: hugegraph-store
    rules:
      # Raft leader elections too frequent
      - alert: FrequentLeaderElections
        expr: rate(raft_leader_election_count[5m]) > 0.01
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Frequent Raft leader elections on {{ $labels.instance }}"

      # RocksDB write stall
      - alert: RocksDBWriteStall
        expr: rocksdb_compaction_pending > 10
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "RocksDB write stall on {{ $labels.instance }}"

      # Disk usage high
      - alert: HighDiskUsage
        expr: disk_used_percent > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Disk usage >85% on {{ $labels.instance }}"

      # Store node down
      - alert: StoreNodeDown
        expr: up{job="hugegraph-store"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Store node {{ $labels.instance }} is down"

Common Issues and Troubleshooting

Issue 1: Raft Leader Election Failures

Symptoms:

  • Write requests fail with “No leader”
  • Frequent leader elections in logs
  • raft.leader.election.count metric increasing rapidly

Diagnosis:

# Check Store logs
tail -f logs/hugegraph-store.log | grep "Raft election"

# Check network latency between Store nodes
ping 192.168.1.21
ping 192.168.1.22

# Check Raft status (via PD)
curl http://192.168.1.10:8620/pd/v1/partitions | jq '.[] | select(.leader == null)'

Root Causes:

  1. Network Partition: Store nodes cannot communicate
  2. High Latency: Network latency >50ms between nodes
  3. Disk I/O Stall: Raft log writes timing out
  4. Clock Skew: System clocks out of sync

Solutions:

  1. Fix Network: Check switches, firewalls, routing
  2. Reduce Latency: Deploy nodes in same datacenter/zone
  3. Check Disk: Use iostat -x 1 to check disk I/O
  4. Sync Clocks: Use NTP to synchronize system clocks
    ntpdate -u pool.ntp.org
    

Issue 2: Partition Imbalance

Symptoms:

  • Some Store nodes have 2x more partitions than others
  • Uneven disk usage across Store nodes
  • Some nodes overloaded, others idle

Diagnosis:

# Check partition distribution
curl  http://localhost:8620/v1/partitionsAndStats 

# Example output (imbalanced):
# {
#   {
#   "partitions": {}, 
#   "partitionStats: {}"
#   }
# }

Root Causes:

  1. New Store Added: Partitions not yet rebalanced
  2. PD Patrol Disabled: Auto-rebalancing not running
  3. Rebalancing Too Slow: patrol-interval too high

Solutions:

  1. Trigger Manual Rebalance (via PD API):

    curl http://192.168.1.10:8620/v1/balanceLeaders
    
  2. Reduce Patrol Interval (in PD application.yml):

    pd:
      patrol-interval: 600  # Rebalance every 10 minutes (instead of 30)
    
  3. Check PD Logs:

    tail -f logs/hugegraph-pd.log | grep "balance"
    
  4. Wait: Rebalancing is gradual (may take hours for large datasets)


Issue 3: Data Migration Slow

Symptoms:

  • Partition migration takes hours
  • Raft snapshot transfer stalled
  • High network traffic but low progress

Diagnosis:

# Check Raft snapshot status
tail -f logs/hugegraph-store.log | grep snapshot

# Check network throughput
iftop -i eth0

# Check disk I/O during snapshot
iostat -x 1

Root Causes:

  1. Large Partitions: Partitions >10GB take long to transfer
  2. Network Bandwidth: Limited bandwidth (<100Mbps)
  3. Disk I/O: Slow disk on target Store

Solutions:

  1. Increase Snapshot Interval (reduce snapshot size):

    raft:
      snapshotInterval: 900  # Snapshot every 15 minutes
    
  2. Increase Network Bandwidth: Use 1Gbps+ network

  3. Parallelize Migration: PD migrates one partition at a time by default

    • Edit PD configuration to allow concurrent migrations (advanced)
  4. Monitor Progress:

    # Check partition state transitions
    curl http://192.168.1.10:8620/v1/partitions | grep -i migrating
    

Issue 4: RocksDB Performance Degradation

Symptoms:

  • Query latency increasing over time
  • rocksdb.read.latency >5ms
  • rocksdb.compaction.pending >5

Diagnosis:

# Check Store logs for compaction
tail -f logs/hugegraph-store.log | grep compaction

Root Causes:

  1. Write Amplification: Too many compactions
  2. Low Cache Hit Rate: Block cache too small
  3. SST File Proliferation: Too many SST files in L0

Solutions:

  1. Increase Block Cache (in application-pd.yml):

    rocksdb:
      block_cache_size: 32000000000  # 32GB (from 16GB)
    
  2. Increase Write Buffer (reduce L0 files):

    rocksdb:
      write_buffer_size: 268435456  # 256MB (from 128MB)
      max_write_buffer_number: 8    # More memtables
    
  3. Restart Store Node (last resort, triggers compaction on startup):

    bin/stop-hugegraph-store.sh
    bin/start-hugegraph-store.sh
    

Issue 5: Store Node Unresponsive

Symptoms:

  • gRPC requests timing out
  • Health check fails
  • CPU or memory at 100%

Diagnosis:

# Check if process is alive
ps aux | grep hugegraph-store

# Check CPU/memory
top -p $(pgrep -f hugegraph-store)

# Check logs
tail -100 logs/hugegraph-store.log

# Check for OOM killer
dmesg | grep -i "out of memory"

# Check disk space
df -h

Root Causes:

  1. Out of Memory (OOM): JVM heap exhausted
  2. Disk Full: No space for Raft logs or RocksDB writes
  3. Thread Deadlock: Internal deadlock in Store code
  4. Network Saturation: Too many concurrent requests

Solutions:

  1. OOM:

    • Increase JVM heap: Edit start-hugegraph-store.sh, set Xmx32g
    • Restart Store node
  2. Disk Full:

    • Clean up old Raft snapshots:
      rm -rf storage/raft/partition-*/snapshot/*  # Keep only latest
      
    • Add more disk space
  3. Thread Deadlock:

    • Take thread dump:
      jstack $(pgrep -f hugegraph-store) > threaddump.txt
      
    • Restart Store node
    • Report to HugeGraph team with thread dump
  4. Network Saturation:

    • Check connection count:
      netstat -an | grep :8500 | wc -l
      
    • Reduce store.max_sessions in Server config
    • Add more Store nodes to distribute load

Backup and Recovery

Backup Strategies

Strategy 1: Snapshot-Based Backup

Frequency: Daily or weekly

Process:

# On each Store node
cd storage

# Create snapshot (Raft snapshots)
# Snapshots are automatically created by Raft every `snapshotInterval` seconds
# Locate latest snapshot:
find raft/partition-*/snapshot -name "snapshot_*" -type d | sort | tail -5

# Copy to backup location
tar -czf backup-store1-$(date +%Y%m%d).tar.gz raft/partition-*/snapshot/*

# Upload to remote storage
scp backup-store1-*.tar.gz backup-server:/backups/

Pros:

  • Fast backup (no downtime)
  • Point-in-time recovery

Cons:

  • Requires all Store nodes to be backed up
  • May miss recent writes (since last snapshot)

Disaster Recovery Procedures

Scenario 1: Single Store Node Failure

Impact: Partitions with replicas on this node lose one replica

Action:

  1. No immediate action needed: Remaining replicas continue serving

  2. Monitor: Check if Raft leaders re-elected

    curl http://192.168.1.10:8620/v1/partitions | grep leader
    
  3. Replace Failed Node:

    • Deploy new Store node with same configuration
    • PD automatically assigns partitions to new node
    • Wait for data replication (may take hours)
  4. Verify: Check partition distribution

     curl  http://localhost:8620/v1/partitionsAndStats
    

Scenario 2: Complete Store Cluster Failure

Impact: All data inaccessible

Action:

  1. Restore PD Cluster (if also failed):

    • Deploy 3 new PD nodes
    • Restore PD metadata from backup
    • Start PD nodes
  2. Restore Store Cluster:

    • Deploy 3 new Store nodes
    • Extract backup on each node:
      cd storage
      tar -xzf /backups/backup-store1-20250129.tar.gz
      
  3. Start Store Nodes:

    bin/start-hugegraph-store.sh
    
  4. Verify Data:

    # Check via Server
    curl http://192.168.1.30:8080/graphspaces/{graphspaces_name}/graphs/{graph_name}/vertices?limit=10
    

Scenario 3: Data Corruption

Impact: RocksDB corruption on one or more partitions

Action:

  1. Identify Corrupted Partition:

    # Check logs for corruption errors
    tail -f logs/hugegraph-store.log | grep -i corrupt
    
  2. Stop Store Node:

    bin/stop-hugegraph-store.sh
    
  3. Delete Corrupted Partition Data:

    # Assuming partition 5 is corrupted
    rm -rf storage/raft/partition-5
    
  4. Restart Store Node:

    bin/start-hugegraph-store.sh
    
  5. Re-replicate Data:

    • Raft automatically re-replicates from healthy replicas
    • Monitor replication progress:
      tail -f logs/hugegraph-store.log | grep "snapshot install"
      

Capacity Management

Monitoring Capacity

Disk Usage:

# Per Store node
du -sh storage/

# Expected growth rate: Track over weeks

Partition Count:

# Current partition count
curl http://192.168.1.10:8620/v1/partitionsAndStatus

# Recommendation: 3-5x Store node count
# Example: 6 Store nodes → 18-30 partitions

Adding Store Nodes

When to Add:

  • Disk usage >80% on existing nodes
  • CPU usage >70% sustained
  • Query latency increasing

Process:

  1. Deploy New Store Node:

    # Follow deployment guide
    tar -xzf apache-hugegraph-store-incubating-1.7.0.tar.gz
    cd apache-hugegraph-store-incubating-1.7.0
    
    # Configure and start
    vi conf/application.yml
    bin/start-hugegraph-store.sh
    
  2. Verify Registration:

    curl http://192.168.1.10:8620/v1/stores
    # New Store should appear
    
  3. Trigger Rebalancing (optional):

    curl -X POST http://192.168.1.10:8620/v1/balanceLeaders
    
  4. Monitor Rebalancing:

    # Watch partition distribution
    watch -n 10 'curl http://192.168.1.10:8620/v1/partitionsAndStatus'
    
  5. Verify: Wait for even distribution (may take hours)

Removing Store Nodes

When to Remove:

  • Decommissioning hardware
  • Downsizing cluster (off-peak hours)

Process:

  1. Mark Store for Removal (via PD API):

    curl --location --request POST 'http://localhost:8080/store/123' \
    --header 'Content-Type: application/json' \
    --data-raw '{
    "storeState": "Off"
    }'
    

    Refer to API definition in StoreAPI::setStore

  2. Wait for Migration:

    • PD migrates all partitions off this Store
  3. Stop Store Node:

    bin/stop-hugegraph-store.sh
    
  4. Remove from PD (optional):


Rolling Upgrades

Upgrade Strategy

Goal: Upgrade cluster with zero downtime

Prerequisites:

  • Version compatibility: Check release notes
  • Backup: Take full backup before upgrade
  • Testing: Test upgrade in staging environment

Upgrade Procedure

Step 1: Upgrade Store Nodes (one at a time)

Node 1:

# Stop Store node
bin/stop-hugegraph-store.sh

# Backup current version
mv apache-hugegraph-store-incubating-1.7.0 apache-hugegraph-store-incubating-1.7.0-backup

# Extract new version
tar -xzf apache-hugegraph-store-incubating-1.8.0.tar.gz
cd apache-hugegraph-store-incubating-1.8.0

# Copy configuration from backup
cp ../apache-hugegraph-store-incubating-1.7.0-backup/conf/application.yml conf/

# Start new version
bin/start-hugegraph-store.sh

# Verify
curl http://192.168.1.20:8520/v1/health
tail -f logs/hugegraph-store.log

Wait 5-10 minutes, then repeat for Node 2, then Node 3.

Step 2: Upgrade PD Nodes (one at a time)

Same process as Store, but upgrade PD cluster first or last (check release notes).

Step 3: Upgrade Server Nodes (one at a time)

# Stop Server
bin/stop-hugegraph.sh

# Upgrade and restart
# (same process as Store)

bin/start-hugegraph.sh

Rollback Procedure

If upgrade fails:

# Stop new version
bin/stop-hugegraph-store.sh

# Restore backup
rm -rf apache-hugegraph-store-incubating-1.8.0
mv apache-hugegraph-store-incubating-1.7.0-backup apache-hugegraph-store-incubating-1.7.0
cd apache-hugegraph-store-incubating-1.7.0

# Restart old version
bin/start-hugegraph-store.sh

For performance tuning, see Best Practices.

For development and debugging, see Development Guide.