This guide covers monitoring, troubleshooting, backup & recovery, and operational procedures for HugeGraph Store in production.
Store Node Metrics:
# Health check curl http://<store-host>:8520/actuator/health # All metrics curl http://<store-host>:8520/actuator/metrics # Specific metric curl http://<store-host>:8520/actuator/metrics/jvm.memory.used
PD Metrics:
curl http://<pd-host>:8620/actuator/metrics
Metric: raft.leader.election.count
Metric: raft.log.apply.latency
Metric: raft.snapshot.create.duration
Metric: rocksdb.read.latency
Metric: rocksdb.write.latency
Metric: rocksdb.compaction.pending
Metric: rocksdb.block.cache.hit.rate
Metric: partition.count
Metric: partition.leader.count
Queries:
# Check partition distribution curl http://localhost:8620/v1/partitionsAndStats # Example output (imbalanced): # { # { # "partitions": {}, # "partitionStats: {}" # } # }
Metric: grpc.request.qps
Metric: grpc.request.latency
Metric: grpc.error.rate
Disk Usage:
# Check Store data directory df -h | grep storage # Recommended: <80% full # Warning: >90% full
Memory Usage:
# JVM heap usage curl http://192.168.1.20:8520/actuator/metrics/jvm.memory.used # RocksDB memory (block cache + memtables) curl http://192.168.1.20:8520/actuator/metrics/rocksdb.memory.usage
CPU Usage:
# Overall CPU top -p $(pgrep -f hugegraph-store) # Recommended: <70% average # Warning: >90% sustained
Configure Prometheus (prometheus.yml):
scrape_configs: - job_name: 'hugegraph-store' static_configs: - targets: - '192.168.1.20:8520' - '192.168.1.21:8520' - '192.168.1.22:8520' metrics_path: '/actuator/prometheus' scrape_interval: 15s
Grafana Dashboard: Import HugeGraph Store dashboard (JSON available in project)
Example Prometheus Alerts (alerts.yml):
groups: - name: hugegraph-store rules: # Raft leader elections too frequent - alert: FrequentLeaderElections expr: rate(raft_leader_election_count[5m]) > 0.01 for: 5m labels: severity: warning annotations: summary: "Frequent Raft leader elections on {{ $labels.instance }}" # RocksDB write stall - alert: RocksDBWriteStall expr: rocksdb_compaction_pending > 10 for: 2m labels: severity: critical annotations: summary: "RocksDB write stall on {{ $labels.instance }}" # Disk usage high - alert: HighDiskUsage expr: disk_used_percent > 85 for: 5m labels: severity: warning annotations: summary: "Disk usage >85% on {{ $labels.instance }}" # Store node down - alert: StoreNodeDown expr: up{job="hugegraph-store"} == 0 for: 1m labels: severity: critical annotations: summary: "Store node {{ $labels.instance }} is down"
Symptoms:
raft.leader.election.count metric increasing rapidlyDiagnosis:
# Check Store logs tail -f logs/hugegraph-store.log | grep "Raft election" # Check network latency between Store nodes ping 192.168.1.21 ping 192.168.1.22 # Check Raft status (via PD) curl http://192.168.1.10:8620/pd/v1/partitions | jq '.[] | select(.leader == null)'
Root Causes:
Solutions:
iostat -x 1 to check disk I/Ontpdate -u pool.ntp.org
Symptoms:
Diagnosis:
# Check partition distribution curl http://localhost:8620/v1/partitionsAndStats # Example output (imbalanced): # { # { # "partitions": {}, # "partitionStats: {}" # } # }
Root Causes:
patrol-interval too highSolutions:
Trigger Manual Rebalance (via PD API):
curl http://192.168.1.10:8620/v1/balanceLeaders
Reduce Patrol Interval (in PD application.yml):
pd: patrol-interval: 600 # Rebalance every 10 minutes (instead of 30)
Check PD Logs:
tail -f logs/hugegraph-pd.log | grep "balance"
Wait: Rebalancing is gradual (may take hours for large datasets)
Symptoms:
Diagnosis:
# Check Raft snapshot status tail -f logs/hugegraph-store.log | grep snapshot # Check network throughput iftop -i eth0 # Check disk I/O during snapshot iostat -x 1
Root Causes:
Solutions:
Increase Snapshot Interval (reduce snapshot size):
raft: snapshotInterval: 900 # Snapshot every 15 minutes
Increase Network Bandwidth: Use 1Gbps+ network
Parallelize Migration: PD migrates one partition at a time by default
Monitor Progress:
# Check partition state transitions curl http://192.168.1.10:8620/v1/partitions | grep -i migrating
Symptoms:
rocksdb.read.latency >5msrocksdb.compaction.pending >5Diagnosis:
# Check Store logs for compaction tail -f logs/hugegraph-store.log | grep compaction
Root Causes:
Solutions:
Increase Block Cache (in application-pd.yml):
rocksdb: block_cache_size: 32000000000 # 32GB (from 16GB)
Increase Write Buffer (reduce L0 files):
rocksdb: write_buffer_size: 268435456 # 256MB (from 128MB) max_write_buffer_number: 8 # More memtables
Restart Store Node (last resort, triggers compaction on startup):
bin/stop-hugegraph-store.sh bin/start-hugegraph-store.sh
Symptoms:
Diagnosis:
# Check if process is alive ps aux | grep hugegraph-store # Check CPU/memory top -p $(pgrep -f hugegraph-store) # Check logs tail -100 logs/hugegraph-store.log # Check for OOM killer dmesg | grep -i "out of memory" # Check disk space df -h
Root Causes:
Solutions:
OOM:
start-hugegraph-store.sh, set Xmx32gDisk Full:
rm -rf storage/raft/partition-*/snapshot/* # Keep only latest
Thread Deadlock:
jstack $(pgrep -f hugegraph-store) > threaddump.txt
Network Saturation:
netstat -an | grep :8500 | wc -l
store.max_sessions in Server configFrequency: Daily or weekly
Process:
# On each Store node cd storage # Create snapshot (Raft snapshots) # Snapshots are automatically created by Raft every `snapshotInterval` seconds # Locate latest snapshot: find raft/partition-*/snapshot -name "snapshot_*" -type d | sort | tail -5 # Copy to backup location tar -czf backup-store1-$(date +%Y%m%d).tar.gz raft/partition-*/snapshot/* # Upload to remote storage scp backup-store1-*.tar.gz backup-server:/backups/
Pros:
Cons:
Impact: Partitions with replicas on this node lose one replica
Action:
No immediate action needed: Remaining replicas continue serving
Monitor: Check if Raft leaders re-elected
curl http://192.168.1.10:8620/v1/partitions | grep leader
Replace Failed Node:
Verify: Check partition distribution
curl http://localhost:8620/v1/partitionsAndStats
Impact: All data inaccessible
Action:
Restore PD Cluster (if also failed):
Restore Store Cluster:
cd storage tar -xzf /backups/backup-store1-20250129.tar.gz
Start Store Nodes:
bin/start-hugegraph-store.sh
Verify Data:
# Check via Server curl http://192.168.1.30:8080/graphspaces/{graphspaces_name}/graphs/{graph_name}/vertices?limit=10
Impact: RocksDB corruption on one or more partitions
Action:
Identify Corrupted Partition:
# Check logs for corruption errors tail -f logs/hugegraph-store.log | grep -i corrupt
Stop Store Node:
bin/stop-hugegraph-store.sh
Delete Corrupted Partition Data:
# Assuming partition 5 is corrupted rm -rf storage/raft/partition-5
Restart Store Node:
bin/start-hugegraph-store.sh
Re-replicate Data:
tail -f logs/hugegraph-store.log | grep "snapshot install"
Disk Usage:
# Per Store node du -sh storage/ # Expected growth rate: Track over weeks
Partition Count:
# Current partition count curl http://192.168.1.10:8620/v1/partitionsAndStatus # Recommendation: 3-5x Store node count # Example: 6 Store nodes → 18-30 partitions
When to Add:
Process:
Deploy New Store Node:
# Follow deployment guide tar -xzf apache-hugegraph-store-incubating-1.7.0.tar.gz cd apache-hugegraph-store-incubating-1.7.0 # Configure and start vi conf/application.yml bin/start-hugegraph-store.sh
Verify Registration:
curl http://192.168.1.10:8620/v1/stores # New Store should appear
Trigger Rebalancing (optional):
curl -X POST http://192.168.1.10:8620/v1/balanceLeaders
Monitor Rebalancing:
# Watch partition distribution watch -n 10 'curl http://192.168.1.10:8620/v1/partitionsAndStatus'
Verify: Wait for even distribution (may take hours)
When to Remove:
Process:
Mark Store for Removal (via PD API):
curl --location --request POST 'http://localhost:8080/store/123' \ --header 'Content-Type: application/json' \ --data-raw '{ "storeState": "Off" }'
Refer to API definition in StoreAPI::setStore
Wait for Migration:
Stop Store Node:
bin/stop-hugegraph-store.sh
Remove from PD (optional):
Goal: Upgrade cluster with zero downtime
Prerequisites:
Node 1:
# Stop Store node bin/stop-hugegraph-store.sh # Backup current version mv apache-hugegraph-store-incubating-1.7.0 apache-hugegraph-store-incubating-1.7.0-backup # Extract new version tar -xzf apache-hugegraph-store-incubating-1.8.0.tar.gz cd apache-hugegraph-store-incubating-1.8.0 # Copy configuration from backup cp ../apache-hugegraph-store-incubating-1.7.0-backup/conf/application.yml conf/ # Start new version bin/start-hugegraph-store.sh # Verify curl http://192.168.1.20:8520/v1/health tail -f logs/hugegraph-store.log
Wait 5-10 minutes, then repeat for Node 2, then Node 3.
Same process as Store, but upgrade PD cluster first or last (check release notes).
# Stop Server bin/stop-hugegraph.sh # Upgrade and restart # (same process as Store) bin/start-hugegraph.sh
If upgrade fails:
# Stop new version bin/stop-hugegraph-store.sh # Restore backup rm -rf apache-hugegraph-store-incubating-1.8.0 mv apache-hugegraph-store-incubating-1.7.0-backup apache-hugegraph-store-incubating-1.7.0 cd apache-hugegraph-store-incubating-1.7.0 # Restart old version bin/start-hugegraph-store.sh
For performance tuning, see Best Practices.
For development and debugging, see Development Guide.