blob: 773d49f94cb9f3a9effbcbaad4005ed150e0e459 [file] [log] [blame] [view]
# Operations Guide
This guide covers monitoring, troubleshooting, backup & recovery, and operational procedures for HugeGraph Store in production.
## Table of Contents
- [Monitoring and Metrics](#monitoring-and-metrics)
- [Common Issues and Troubleshooting](#common-issues-and-troubleshooting)
- [Backup and Recovery](#backup-and-recovery)
- [Capacity Management](#capacity-management)
- [Rolling Upgrades](#rolling-upgrades)
---
## Monitoring and Metrics
### Metrics Endpoints
**Store Node Metrics**:
```bash
# Health check
curl http://<store-host>:8520/actuator/health
# All metrics
curl http://<store-host>:8520/actuator/metrics
# Specific metric
curl http://<store-host>:8520/actuator/metrics/jvm.memory.used
```
**PD Metrics**:
```bash
curl http://<pd-host>:8620/actuator/metrics
```
### Key Metrics to Monitor
#### 1. Raft Metrics
**Metric**: `raft.leader.election.count`
- **Description**: Number of leader elections
- **Normal**: 0-1 per hour (initial election)
- **Warning**: >5 per hour (network issues or node instability)
**Metric**: `raft.log.apply.latency`
- **Description**: Time to apply Raft log entries (ms)
- **Normal**: <10ms (p99)
- **Warning**: >50ms (disk I/O bottleneck)
**Metric**: `raft.snapshot.create.duration`
- **Description**: Snapshot creation time (ms)
- **Normal**: <30,000ms (30 seconds)
- **Warning**: >60,000ms (large partition or slow disk)
#### 2. RocksDB Metrics
**Metric**: `rocksdb.read.latency`
- **Description**: RocksDB read latency (microseconds)
- **Normal**: <1000μs (1ms) for p99
- **Warning**: >5000μs (5ms) - check compaction or cache hit rate
**Metric**: `rocksdb.write.latency`
- **Description**: RocksDB write latency (microseconds)
- **Normal**: <2000μs (2ms) for p99
- **Warning**: >10000μs (10ms) - check compaction backlog
**Metric**: `rocksdb.compaction.pending`
- **Description**: Number of pending compactions
- **Normal**: 0-2
- **Warning**: >5 (write stall likely)
**Metric**: `rocksdb.block.cache.hit.rate`
- **Description**: Block cache hit rate (%)
- **Normal**: >90%
- **Warning**: <70% (increase cache size)
#### 3. Partition Metrics
**Metric**: `partition.count`
- **Description**: Number of partitions on this Store node
- **Normal**: Evenly distributed across nodes
- **Warning**: >2x average (rebalancing needed)
**Metric**: `partition.leader.count`
- **Description**: Number of Raft leaders on this node
- **Normal**: ~partitionCount / 3 (for 3 replicas)
- **Warning**: 0 (node cannot serve writes)
**Queries**:
```bash
# Check partition distribution
curl http://localhost:8620/v1/partitionsAndStats
# Example output (imbalanced):
# {
# {
# "partitions": {},
# "partitionStats: {}"
# }
# }
```
#### 4. gRPC Metrics
**Metric**: `grpc.request.qps`
- **Description**: Requests per second
- **Normal**: Depends on workload
- **Warning**: Sudden drops (connection issues)
**Metric**: `grpc.request.latency`
- **Description**: gRPC request latency (ms)
- **Normal**: <20ms for p99
- **Warning**: >100ms (network or processing bottleneck)
**Metric**: `grpc.error.rate`
- **Description**: Error rate (errors/sec)
- **Normal**: <1% of QPS
- **Warning**: >5% (investigate errors)
#### 5. System Metrics
**Disk Usage**:
```bash
# Check Store data directory
df -h | grep storage
# Recommended: <80% full
# Warning: >90% full
```
**Memory Usage**:
```bash
# JVM heap usage
curl http://192.168.1.20:8520/actuator/metrics/jvm.memory.used
# RocksDB memory (block cache + memtables)
curl http://192.168.1.20:8520/actuator/metrics/rocksdb.memory.usage
```
**CPU Usage**:
```bash
# Overall CPU
top -p $(pgrep -f hugegraph-store)
# Recommended: <70% average
# Warning: >90% sustained
```
### Prometheus Integration
**Configure Prometheus** (`prometheus.yml`):
```yaml
scrape_configs:
- job_name: 'hugegraph-store'
static_configs:
- targets:
- '192.168.1.20:8520'
- '192.168.1.21:8520'
- '192.168.1.22:8520'
metrics_path: '/actuator/prometheus'
scrape_interval: 15s
```
**Grafana Dashboard**: Import HugeGraph Store dashboard (JSON available in project)
### Alert Rules
**Example Prometheus Alerts** (`alerts.yml`):
```yaml
groups:
- name: hugegraph-store
rules:
# Raft leader elections too frequent
- alert: FrequentLeaderElections
expr: rate(raft_leader_election_count[5m]) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "Frequent Raft leader elections on {{ $labels.instance }}"
# RocksDB write stall
- alert: RocksDBWriteStall
expr: rocksdb_compaction_pending > 10
for: 2m
labels:
severity: critical
annotations:
summary: "RocksDB write stall on {{ $labels.instance }}"
# Disk usage high
- alert: HighDiskUsage
expr: disk_used_percent > 85
for: 5m
labels:
severity: warning
annotations:
summary: "Disk usage >85% on {{ $labels.instance }}"
# Store node down
- alert: StoreNodeDown
expr: up{job="hugegraph-store"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Store node {{ $labels.instance }} is down"
```
---
## Common Issues and Troubleshooting
### Issue 1: Raft Leader Election Failures
**Symptoms**:
- Write requests fail with "No leader"
- Frequent leader elections in logs
- `raft.leader.election.count` metric increasing rapidly
**Diagnosis**:
```bash
# Check Store logs
tail -f logs/hugegraph-store.log | grep "Raft election"
# Check network latency between Store nodes
ping 192.168.1.21
ping 192.168.1.22
# Check Raft status (via PD)
curl http://192.168.1.10:8620/pd/v1/partitions | jq '.[] | select(.leader == null)'
```
**Root Causes**:
1. **Network Partition**: Store nodes cannot communicate
2. **High Latency**: Network latency >50ms between nodes
3. **Disk I/O Stall**: Raft log writes timing out
4. **Clock Skew**: System clocks out of sync
**Solutions**:
1. **Fix Network**: Check switches, firewalls, routing
2. **Reduce Latency**: Deploy nodes in same datacenter/zone
3. **Check Disk**: Use `iostat -x 1` to check disk I/O
4. **Sync Clocks**: Use NTP to synchronize system clocks
```bash
ntpdate -u pool.ntp.org
```
---
### Issue 2: Partition Imbalance
**Symptoms**:
- Some Store nodes have 2x more partitions than others
- Uneven disk usage across Store nodes
- Some nodes overloaded, others idle
**Diagnosis**:
```bash
# Check partition distribution
curl http://localhost:8620/v1/partitionsAndStats
# Example output (imbalanced):
# {
# {
# "partitions": {},
# "partitionStats: {}"
# }
# }
```
**Root Causes**:
1. **New Store Added**: Partitions not yet rebalanced
2. **PD Patrol Disabled**: Auto-rebalancing not running
3. **Rebalancing Too Slow**: `patrol-interval` too high
**Solutions**:
1. **Trigger Manual Rebalance** (via PD API):
```bash
curl http://192.168.1.10:8620/v1/balanceLeaders
```
2. **Reduce Patrol Interval** (in PD `application.yml`):
```yaml
pd:
patrol-interval: 600 # Rebalance every 10 minutes (instead of 30)
```
3. **Check PD Logs**:
```bash
tail -f logs/hugegraph-pd.log | grep "balance"
```
4. **Wait**: Rebalancing is gradual (may take hours for large datasets)
---
### Issue 3: Data Migration Slow
**Symptoms**:
- Partition migration takes hours
- Raft snapshot transfer stalled
- High network traffic but low progress
**Diagnosis**:
```bash
# Check Raft snapshot status
tail -f logs/hugegraph-store.log | grep snapshot
# Check network throughput
iftop -i eth0
# Check disk I/O during snapshot
iostat -x 1
```
**Root Causes**:
1. **Large Partitions**: Partitions >10GB take long to transfer
2. **Network Bandwidth**: Limited bandwidth (<100Mbps)
3. **Disk I/O**: Slow disk on target Store
**Solutions**:
1. **Increase Snapshot Interval** (reduce snapshot size):
```yaml
raft:
snapshotInterval: 900 # Snapshot every 15 minutes
```
2. **Increase Network Bandwidth**: Use 1Gbps+ network
3. **Parallelize Migration**: PD migrates one partition at a time by default
- Edit PD configuration to allow concurrent migrations (advanced)
4. **Monitor Progress**:
```bash
# Check partition state transitions
curl http://192.168.1.10:8620/v1/partitions | grep -i migrating
```
---
### Issue 4: RocksDB Performance Degradation
**Symptoms**:
- Query latency increasing over time
- `rocksdb.read.latency` >5ms
- `rocksdb.compaction.pending` >5
**Diagnosis**:
```bash
# Check Store logs for compaction
tail -f logs/hugegraph-store.log | grep compaction
```
**Root Causes**:
1. **Write Amplification**: Too many compactions
2. **Low Cache Hit Rate**: Block cache too small
3. **SST File Proliferation**: Too many SST files in L0
**Solutions**:
1. **Increase Block Cache** (in `application-pd.yml`):
```yaml
rocksdb:
block_cache_size: 32000000000 # 32GB (from 16GB)
```
2. **Increase Write Buffer** (reduce L0 files):
```yaml
rocksdb:
write_buffer_size: 268435456 # 256MB (from 128MB)
max_write_buffer_number: 8 # More memtables
```
3. **Restart Store Node** (last resort, triggers compaction on startup):
```bash
bin/stop-hugegraph-store.sh
bin/start-hugegraph-store.sh
```
---
### Issue 5: Store Node Unresponsive
**Symptoms**:
- gRPC requests timing out
- Health check fails
- CPU or memory at 100%
**Diagnosis**:
```bash
# Check if process is alive
ps aux | grep hugegraph-store
# Check CPU/memory
top -p $(pgrep -f hugegraph-store)
# Check logs
tail -100 logs/hugegraph-store.log
# Check for OOM killer
dmesg | grep -i "out of memory"
# Check disk space
df -h
```
**Root Causes**:
1. **Out of Memory (OOM)**: JVM heap exhausted
2. **Disk Full**: No space for Raft logs or RocksDB writes
3. **Thread Deadlock**: Internal deadlock in Store code
4. **Network Saturation**: Too many concurrent requests
**Solutions**:
1. **OOM**:
- Increase JVM heap: Edit `start-hugegraph-store.sh`, set `Xmx32g`
- Restart Store node
2. **Disk Full**:
- Clean up old Raft snapshots:
```bash
rm -rf storage/raft/partition-*/snapshot/* # Keep only latest
```
- Add more disk space
3. **Thread Deadlock**:
- Take thread dump:
```bash
jstack $(pgrep -f hugegraph-store) > threaddump.txt
```
- Restart Store node
- Report to HugeGraph team with thread dump
4. **Network Saturation**:
- Check connection count:
```bash
netstat -an | grep :8500 | wc -l
```
- Reduce `store.max_sessions` in Server config
- Add more Store nodes to distribute load
---
## Backup and Recovery
### Backup Strategies
#### Strategy 1: Snapshot-Based Backup
**Frequency**: Daily or weekly
**Process**:
```bash
# On each Store node
cd storage
# Create snapshot (Raft snapshots)
# Snapshots are automatically created by Raft every `snapshotInterval` seconds
# Locate latest snapshot:
find raft/partition-*/snapshot -name "snapshot_*" -type d | sort | tail -5
# Copy to backup location
tar -czf backup-store1-$(date +%Y%m%d).tar.gz raft/partition-*/snapshot/*
# Upload to remote storage
scp backup-store1-*.tar.gz backup-server:/backups/
```
**Pros**:
- Fast backup (no downtime)
- Point-in-time recovery
**Cons**:
- Requires all Store nodes to be backed up
- May miss recent writes (since last snapshot)
### Disaster Recovery Procedures
#### Scenario 1: Single Store Node Failure
**Impact**: Partitions with replicas on this node lose one replica
**Action**:
1. **No immediate action needed**: Remaining replicas continue serving
2. **Monitor**: Check if Raft leaders re-elected
```bash
curl http://192.168.1.10:8620/v1/partitions | grep leader
```
3. **Replace Failed Node**:
- Deploy new Store node with same configuration
- PD automatically assigns partitions to new node
- Wait for data replication (may take hours)
4. **Verify**: Check partition distribution
```bash
curl http://localhost:8620/v1/partitionsAndStats
```
#### Scenario 2: Complete Store Cluster Failure
**Impact**: All data inaccessible
**Action**:
1. **Restore PD Cluster** (if also failed):
- Deploy 3 new PD nodes
- Restore PD metadata from backup
- Start PD nodes
2. **Restore Store Cluster**:
- Deploy 3 new Store nodes
- Extract backup on each node:
```bash
cd storage
tar -xzf /backups/backup-store1-20250129.tar.gz
```
3. **Start Store Nodes**:
```bash
bin/start-hugegraph-store.sh
```
4. **Verify Data**:
```bash
# Check via Server
curl http://192.168.1.30:8080/graphspaces/{graphspaces_name}/graphs/{graph_name}/vertices?limit=10
```
#### Scenario 3: Data Corruption
**Impact**: RocksDB corruption on one or more partitions
**Action**:
1. **Identify Corrupted Partition**:
```bash
# Check logs for corruption errors
tail -f logs/hugegraph-store.log | grep -i corrupt
```
2. **Stop Store Node**:
```bash
bin/stop-hugegraph-store.sh
```
3. **Delete Corrupted Partition Data**:
```bash
# Assuming partition 5 is corrupted
rm -rf storage/raft/partition-5
```
4. **Restart Store Node**:
```bash
bin/start-hugegraph-store.sh
```
5. **Re-replicate Data**:
- Raft automatically re-replicates from healthy replicas
- Monitor replication progress:
```bash
tail -f logs/hugegraph-store.log | grep "snapshot install"
```
---
## Capacity Management
### Monitoring Capacity
**Disk Usage**:
```bash
# Per Store node
du -sh storage/
# Expected growth rate: Track over weeks
```
**Partition Count**:
```bash
# Current partition count
curl http://192.168.1.10:8620/v1/partitionsAndStatus
# Recommendation: 3-5x Store node count
# Example: 6 Store nodes → 18-30 partitions
```
### Adding Store Nodes
**When to Add**:
- Disk usage >80% on existing nodes
- CPU usage >70% sustained
- Query latency increasing
**Process**:
1. **Deploy New Store Node**:
```bash
# Follow deployment guide
tar -xzf apache-hugegraph-store-1.7.0.tar.gz
cd apache-hugegraph-store-1.7.0
# Configure and start
vi conf/application.yml
bin/start-hugegraph-store.sh
```
2. **Verify Registration**:
```bash
curl http://192.168.1.10:8620/v1/stores
# New Store should appear
```
3. **Trigger Rebalancing** (optional):
```bash
curl -X POST http://192.168.1.10:8620/v1/balanceLeaders
```
4. **Monitor Rebalancing**:
```bash
# Watch partition distribution
watch -n 10 'curl http://192.168.1.10:8620/v1/partitionsAndStatus'
```
5. **Verify**: Wait for even distribution (may take hours)
### Removing Store Nodes
**When to Remove**:
- Decommissioning hardware
- Downsizing cluster (off-peak hours)
**Process**:
1. **Mark Store for Removal** (via PD API):
```bash
curl --location --request POST 'http://localhost:8080/store/123' \
--header 'Content-Type: application/json' \
--data-raw '{
"storeState": "Off"
}'
```
Refer to API definition in `StoreAPI::setStore`
2. **Wait for Migration**:
- PD migrates all partitions off this Store
3. **Stop Store Node**:
```bash
bin/stop-hugegraph-store.sh
```
4. **Remove from PD** (optional):
---
## Rolling Upgrades
### Upgrade Strategy
**Goal**: Upgrade cluster with zero downtime
**Prerequisites**:
- Version compatibility: Check release notes
- Backup: Take full backup before upgrade
- Testing: Test upgrade in staging environment
### Upgrade Procedure
#### Step 1: Upgrade Store Nodes (one at a time)
**Node 1**:
```bash
# Stop Store node
bin/stop-hugegraph-store.sh
# Backup current version
mv apache-hugegraph-store-1.7.0 apache-hugegraph-store-1.7.0-backup
# Extract new version
tar -xzf apache-hugegraph-store-1.8.0.tar.gz
cd apache-hugegraph-store-1.8.0
# Copy configuration from backup
cp ../apache-hugegraph-store-1.7.0-backup/conf/application.yml conf/
# Start new version
bin/start-hugegraph-store.sh
# Verify
curl http://192.168.1.20:8520/v1/health
tail -f logs/hugegraph-store.log
```
**Wait 5-10 minutes**, then repeat for Node 2, then Node 3.
#### Step 2: Upgrade PD Nodes (one at a time)
Same process as Store, but upgrade PD cluster first or last (check release notes).
#### Step 3: Upgrade Server Nodes (one at a time)
```bash
# Stop Server
bin/stop-hugegraph.sh
# Upgrade and restart
# (same process as Store)
bin/start-hugegraph.sh
```
### Rollback Procedure
If upgrade fails:
```bash
# Stop new version
bin/stop-hugegraph-store.sh
# Restore backup
rm -rf apache-hugegraph-store-1.8.0
mv apache-hugegraph-store-1.7.0-backup apache-hugegraph-store-1.7.0
cd apache-hugegraph-store-1.7.0
# Restart old version
bin/start-hugegraph-store.sh
```
---
For performance tuning, see [Best Practices](best-practices.md).
For development and debugging, see [Development Guide](development-guide.md).