blob: 17b8857a3558316e6cf28d0e06d912845e188fe7 [file] [view]
Title: DevOps CMDB and Incident Knowledge Graph Language-Agnostic Specification
Summary
A vendor/framework/language-neutral specification for a graph-based Configuration Management Database (CMDB) and Incident Knowledge Graph suitable for building example applications and upgrade exercises on Apache TinkerPop. This document defines the domain model (vertices, edges, and properties), recommended identifiers and indexing guidelines, data quality constraints, example workloads and traversals, and non-functional requirements. The goal is to make it easy for an LLM (or a developer) to generate an implementation in any language, with any web framework, on any supported TinkerPop/graph provider. Assumption: a Gremlin Servercompliant system (or equivalent protocol-compatible service) is available for remote traversal submission.
Goals
- Provide a realistic, moderately complex domain to exercise graph modeling, traversals, and API design.
- Encourage use of rich edge properties (temporal, metrics, provenance) to demonstrate graph advantages over relational models.
- Keep the spec portable across programming languages, application frameworks, and TinkerPop versions/providers.
- Support common operational queries: blast radius, dependency discovery, root-cause hints, deployment/incident correlation, ownership, and timelines.
Non-Goals
- Prescribe a specific programming language, framework, or persistence backend.
- Require a particular TinkerPop version or feature outside the broadly available core.
- Provide exhaustive production hardening (security/compliance) beyond reasonable examples and guidance.
Conceptual Overview
The graph models services, applications, infrastructure, deployments, dependencies, teams, alerts, and incidents. It enables:
- Traversal across service/application/runtime dependencies.
- Temporal reasoning about deployments, alerts, and incidents.
- Impact analysis (blast radius) and ownership discovery.
- Correlation between runtime signals and change events.
Canonical Labels and Properties
Notes
- Use explicit property names; no import wildcards in generated code.
- Prefer stable string identifiers (natural keys) over auto-increment integers.
- Property types are suggestions; adapt to your graph providers supported types.
- Cardinality defaults to single unless specified as set or list.
Vertex Labels
1) service
- serviceId: String (unique, stable) e.g., svc:payments
- name: String
- tier: String (e.g., critical, high, medium, low)
- ownerTeamId: String (FK to team.teamId)
- sla: String (e.g., "99.9%/30d")
- env: String (e.g., prod, staging, dev)
- tags: Set<String>
- criticality: Int (1..5)
- createdAt: Instant
- updatedAt: Instant
2) application
- appId: String (unique) e.g., app:payments-api
- name: String
- language: String (free-form; do not assume a particular runtime)
- repo: String (URL or VCS path)
- version: String (semantic version or build hash)
- runtime: String (e.g., jvm, dotnet, nodejs, python)
- env: String
- createdAt: Instant
- updatedAt: Instant
3) host
- hostId: String (unique) hostname, node id, or cloud instance id
- hostname: String
- provider: String (e.g., aws, gcp, onprem)
- region: String
- availabilityZone: String
- env: String
- os: String
- instanceType: String
- createdAt: Instant
- updatedAt: Instant
4) deployment
- deployId: String (unique)
- version: String
- initiatedBy: String (user/bot)
- deployedAt: Instant
- status: String (e.g., success, failed, partial, inProgress)
- changeSet: List<String> (commit ids/PRs/tickets)
- env: String
5) dependency (external resource)
- depId: String (unique)
- type: String (db, cache, queue, api, storage, dns, secret, other)
- name: String
- vendor: String
- criticality: Int (1..5)
- endpoint: String (URL, ARN, etc.)
- env: String
- createdAt: Instant
- updatedAt: Instant
6) team
- teamId: String (unique)
- name: String
- onCall: Boolean
- pager: String (optional)
- slack: String (optional)
- createdAt: Instant
- updatedAt: Instant
7) incident
- incidentId: String (unique)
- severity: String (sev0, sev1, sev2, sev3)
- state: String (open, investigating, mitigated, closed)
- openedAt: Instant
- closedAt: Instant? (nullable)
- summary: String
- description: String
- env: String
- createdBy: String
- updatedAt: Instant
8) alert
- alertId: String (unique)
- source: String (monitoring system)
- metric: String (e.g., errorRate, latency)
- firedAt: Instant
- clearedAt: Instant? (nullable)
- severity: String
- env: String
- fingerprint: String (grouping key)
- createdAt: Instant
- updatedAt: Instant
Edge Labels (camelCase) with Rich Properties
Note: Edge properties are central to this model; use them to encode time, confidence, quantitative metrics, and provenance.
A) runsOn (application host)
- since: Instant when the app started running on the host
- status: String (active, draining, retired)
- instanceId: String (container/pod id, process id)
- lastSeen: Instant last heartbeat
- cpuPct: Double
- memPct: Double
- confidence: Double (0..1) if relationship inferred
B) deployedAs (service application)
- since: Instant when the service started being represented by the app
- strategy: String (blueGreen, canary, rolling, recreate)
- trafficPct: Double percent traffic routed
- status: String (active, deprecated)
- notes: String
C) deploymentOf (deployment application)
- relationCreatedAt: Instant record creation time
- targetVersion: String duplicate for quick reads
- result: String (success, failed, partial)
- durationMs: Long (elapsed time)
- changeTicket: String (optional external id)
D) deployedTo (deployment host)
- at: Instant when the deployment reached this host
- result: String (success, failed, skipped)
- attempts: Int
- durationMs: Long
- rollback: Boolean
E) dependsOn (service dependency)
- since: Instant when the dependency was added
- required: Boolean hard vs soft dependency
- direction: String (read, write, both)
- latencyP50Ms: Long (observed median)
- errorRatePct: Double
- weight: Double (relative importance, 0..1)
- lastVerifiedAt: Instant
F) ownedBy (service team)
- since: Instant
- role: String (primary, secondary, consulting)
- escalationPolicy: String
G) triggered (alert incident)
- at: Instant correlation time
- correlationScore: Double (0..1)
- reason: String (short explanation or rule id)
- evidenceId: String (pointer to logs/trace)
H) affects (incident service)
- at: Instant first known impact
- impactType: String (degradation, outage, dataLoss, security)
- scope: String (percent traffic affected, region subset)
- slaImpactPct: Double
- confidence: Double (0..1)
I) relatedTo (incident incident) [undirected semantics; model with two directed edges if needed]
- correlationScore: Double
- windowStart: Instant
- windowEnd: Instant
- relationType: String (duplicateCause, followOn, sharedDependency, coOccurrence)
J) calls (application application)
- since: Instant
- protocol: String (http, grpc, mq, rpc)
- method: String (GET/POST or operation name)
- avgLatencyMs: Long
- errorRatePct: Double
- sampleRatePct: Double
- lastSeen: Instant
Identifiers and Cardinality
- Vertex IDs: Use stable string keys (e.g., serviceId) as custom IDs where supported; otherwise store as a dedicated property and map internal IDs as needed.
- Edge multiplicity: Allow multiple edges between the same endpoints if they differ by context/time (e.g., calls over protocols). Consider a uniqueness policy per (outV, inV, label, keyPropertiesTuple).
- Timestamps: Use Instant (UTC). Where Instant unsupported, use epochMillis (Long).
- Sets vs Lists: tags are Set; changeSet is List to preserve order.
Indexing Guidance (Provider-Agnostic)
- Equality/lookup indices:
- Vertex: service.serviceId, application.appId, host.hostId, deployment.deployId, dependency.depId, team.teamId, incident.incidentId, alert.alertId.
- Common filters: env, name, severity, state, version.
- Edge-centric indices (where supported):
- dependsOn.since, dependsOn.required
- calls.protocol, calls.method
- affected edges by at, impactType
- Full-text (optional): incident.summary/description for search.
- Composite vs Mixed: choose based on provider; ensure exact match queries benefit.
Data Quality Constraints (Logical)
- Uniqueness: All *Id properties are globally unique per label.
- Referential integrity: Edge endpoints must exist; deletion should consider cascading or soft-delete.
- Enum validation: severity, state, protocol, strategy, impactType, result should match allowed sets.
- Temporal consistency: openedAt closedAt; since lastSeen; windowStart windowEnd.
- Env consistency: Cross-env edges generally discouraged except where intended (documented exceptions only).
- Confidence bounds: 0 confidence 1.0; 0 errorRatePct 100; 0 trafficPct 100.
Minimum Viable Dataset (for examples/tests)
A deterministic sample dataset is provided as a Gremlin script at docs/upgrade/spec/cmdb-data.gremlin. Each line is a single mutation (one vertex or one edge) terminated with iterate() for stable, idempotent loading via a Gremlin Servercompliant system.
Example Traversal Patterns (Gremlin-like Pseudocode)
Note: The following are conceptual and avoid version-specific constructs. Adapt to your providers traversal DSL. Variable names are illustrative.
1) Blast radius from an incident (impacted services and upstream relations)
- Input: incidentId
- Idea: From an incident, traverse affects to services, then explore transitive dependencies and deployments.
g.V().has('incident','incidentId', iid)
.out('affects')
.emit()
.repeat(out('dependsOn','deployedAs').in('deploymentOf'))
.path()
2) Root-cause candidates (upstream first failing dependency)
- Input: incidentId
g.V().has('incident','incidentId', iid)
.out('affects')
.repeat(out('dependsOn')).emit()
.where(outE('dependsOn').values('errorRatePct').is(gt(5.0)))
.limit(50)
3) Deployment incident correlation
- Input: deployId
g.V().has('deployment','deployId', did)
.out('deployedTo').in('runsOn').in('deployedAs').dedup()
.in('affects').hasLabel('incident')
.order().by('openedAt', decr).limit(20)
4) Ownership and on-call for an impacted service
- Input: incidentId
g.V().has('incident','incidentId', iid)
.out('affects')
.out('ownedBy')
.valueMap(true)
5) Hot paths between applications (runtime call graph)
- Input: appId
g.V().has('application','appId', aid)
.repeat(outE('calls').order().by('avgLatencyMs', decr).inV()).times(3)
.path()
6) Services at risk due to a failing dependency
- Input: depId
g.V().has('dependency','depId', did)
.inE('dependsOn').has('required', true).outV()
.dedup()
7) Time-bounded alert to incident correlation
- Inputs: fingerprint, windowStart, windowEnd
g.V().has('alert','fingerprint', fp)
.has('firedAt', between(windowStart, windowEnd))
.outE('triggered')
.has('correlationScore', gt(0.6))
.inV()
Suggested API Surfaces (Language-Neutral)
These are example resources for a REST/RPC/GraphQL interface. They are intentionally generic.
- GET /incidents/{id}/blastRadius returns affected services and transitive dependencies with paths.
- GET /services/{id}/dependencies?depth=N returns direct and transitive dependencies.
- GET /deployments/{id}/timeline returns path from deployment to impacted services/incidents.
- GET /services/{id}/owners returns team ownership and escalation info.
- POST /alerts upsert alert vertices and (optionally) triggered edges.
- POST /incidents create/modify incident vertices and affects edges.
- POST /deployments create deployment vertices and link to application/host.
Operational Concerns (General)
- Transactions: Group related vertex/edge mutations atomically when supported; otherwise implement idempotency keys on the write path.
- Idempotency: Upserts keyed by natural IDs; avoid accidental duplication.
- Pagination: Use limit/offset or cursor; cap server-side to protect the graph.
- Validation: Enforce enum bounds, temporal checks, and numeric ranges before writes.
- Timestamps: Use UTC; prefer Instant or epochMillis consistently.
- Observability: Consider storing minimal audit metadata on edges (createdBy, updatedAt) if provider supports meta-properties.
Testing Guidance
This section specifies exact black-box tests to run against an implementation that follows this spec, assuming the deterministic dataset at docs/upgrade/spec/cmdb-data.gremlin has been loaded into a Gremlin Servercompliant system (traversal source g). Tests are language- and framework-agnostic. For REST, substitute your base URL; for other transports the inputs/outputs are equivalent.
General setup
- Precondition: Seed the dataset by submitting each non-empty, non-comment line from docs/upgrade/spec/cmdb-data.gremlin. Ensure a clean graph (e.g., drop all vertices first) so counts match exactly.
- Conventions: All timestamps are epochMillis; string IDs are exact and case sensitive.
Smoke and readiness
- Call: GET /health/ready
- Expect: HTTP 200 with body containing "ready" (or equivalent readiness signal).
Service dependencies (GET /services/{id}/dependencies?depth=N)
1) svc:orders, depth=2
- Call: GET /services/svc:orders/dependencies?depth=2
- Expect: HTTP 200; an ordered or unordered list of vertex value maps including exactly these two vertices:
- service svc:orders (properties include serviceId: "svc:orders")
- dependency dep:orders-db (properties include depId: "dep:orders-db")
- Count: 2 total elements. No other services or dependencies should appear for this service in the provided dataset.
2) svc:payments, depth=1
- Call: GET /services/svc:payments/dependencies?depth=1
- Expect: HTTP 200; list of value maps for:
- service svc:payments
- dependency dep:payments-db
- dependency dep:payments-queue
- Count: 3 total elements.
3) svc:inventory, depth=3
- Call: GET /services/svc:inventory/dependencies?depth=3
- Expect: HTTP 200; list includes:
- service svc:inventory
- dependency dep:inventory-cache
- Count: 2 total elements (no additional transitive dependencies in the sample data).
Service owners (GET /services/{id}/owners)
4) svc:payments
- Call: GET /services/svc:payments/owners
- Expect: HTTP 200; list includes exactly one team vertex value map with:
- teamId: "team:payments"
- name: "Payments"
- onCall: true
- Count: 1 total element.
5) svc:api-gateway
- Call: GET /services/svc:api-gateway/owners
- Expect: HTTP 200; list includes exactly one team with teamId: "team:platform" and name: "Platform Engineering"
- Count: 1 total element.
Deployment timeline (GET /deployments/{id}/timeline)
6) depoy:payments-20231113.2 (partial deployment that failed on at least one host)
- Call: GET /deployments/depoy:payments-20231113.2/timeline
- Expect: HTTP 200; list of incident vertex value maps related via the traversal (deployedTo -> runsOn -> deployedAs -> affects). Must include exactly:
- incidentId: "inc:sev1-payments-001" (severity sev1)
- Count: 1 total element.
7) depoy:payments-20231114.1 (successful follow-up)
- Call: GET /deployments/depoy:payments-20231114.1/timeline
- Expect: HTTP 200; list includes the same incident as above:
- incidentId: "inc:sev1-payments-001"
- Count: 1 total element.
8) depoy:api-gateway-20231114.1
- Call: GET /deployments/depoy:api-gateway-20231114.1/timeline
- Expect: HTTP 200; list includes:
- incidentId: "inc:sev1-api-001"
- Count: 1 total element.
Incident blast radius paths (GET /incidents/{id}/blastRadius?depth=N)
Notes: Responses are typically represented as a collection of paths where each object is a valueMap(true) of the vertex at that step. Order within a path follows traversal order. Exact path ordering across the response may vary by provider, but membership assertions below must hold.
9) inc:sev1-payments-001, depth=1
- Call: GET /incidents/inc:sev1-payments-001/blastRadius?depth=1
- Expect: HTTP 200; at least two paths starting with the incident then the affected service svc:payments and then one hop of the repeat step. The set of terminal vertices across all returned paths must contain:
- dependency dep:payments-db
- dependency dep:payments-queue
- deployment vertices reachable via deployedAs/deploymentOf for the payments service: depoy:payments-20231113.2 and depoy:payments-20231114.1
- Minimal membership checks:
- There exists a path whose last vertex has depId = "dep:payments-db"
- There exists a path whose last vertex has depId = "dep:payments-queue"
- There exists a path whose last vertex has deployId = "depoy:payments-20231113.2"
- There exists a path whose last vertex has deployId = "depoy:payments-20231114.1"
10) inc:sev1-api-001, depth=1
- Call: GET /incidents/inc:sev1-api-001/blastRadius?depth=1
- Expect: HTTP 200; paths must begin with incident inc:sev1-api-001 then service svc:api-gateway. The set of terminal vertices across all returned paths must include:
- deployment depoy:api-gateway-20231113.1
- deployment depoy:api-gateway-20231114.1
- Minimal membership checks (as above) for both deployIds.
Negative and edge cases
11) Unknown serviceId
- Call: GET /services/svc:does-not-exist/dependencies?depth=2
- Expect: HTTP 200 with an empty list (no results).
12) Service with optional dependency (required=false)
- Call: GET /services/svc:inventory/dependencies?depth=1
- Expect: HTTP 200 with exactly two elements: svc:inventory and dep:inventory-cache. Verify that dependency existence is returned regardless of the required flag (edge property), as this endpoint enumerates topology not policy.
13) Readiness under graph reset
- Procedure: Drop all vertices, reload the dataset, then call GET /health/ready
- Expect: HTTP 200 and subsequent endpoint tests pass identically to the above, demonstrating determinism.
Determinism requirements
- Use the exact IDs in this section. Counts and membership must match precisely for a conforming implementation using the provided dataset.
- If your transport or framework changes ordering, assert on membership and counts rather than list order.
Data Loading and Fixtures
- Provide a deterministic data generator creating the Minimum Viable Dataset with:
- Multiple envs (prod, staging) and regions to exercise filters.
- Multi-hop dependsOn and calls chains (depth 3).
- Mixed success/failed deployments around incident windows for correlation.
- Loading order (suggested): teams services applications hosts dependencies deployments incidents alerts edges linking them.
Upgrade/Portability Considerations
- Exercise a broad set of Gremlin steps and traversal patterns (e.g., aggregate, project, path, repeat/emit, barrier, dedup, select/sack, coalesce, optional, groupCount, order(local), by(), math, where with predicates, subgraph, mergeV/E, profile) to intentionally surface potential upgrade differences; document provider/version nuances and, where practical, supply optional fallbacks.
- Avoid provider-specific features unless guarded with fallbacks.
- Keep label and property names stable; version and evolve via additive changes.
- When deprecating: retain old properties/edges, populate new ones in parallel, and document migration traversals.
Glossary
- CMDB: Configuration Management Database
- Blast Radius: The set of entities affected by an incident/change
- Root Cause Candidates: Upstream dependencies most likely responsible for impact
- Provenance: Metadata indicating how/when a relation was inferred or observed
Appendix: Quick Reference of Labels
Vertices: service, application, host, deployment, dependency, team, incident, alert
Edges (camelCase): runsOn, deployedAs, deploymentOf, deployedTo, dependsOn, ownedBy, triggered, affects, relatedTo, calls
This specification is intentionally framework-agnostic and should be sufficient for an LLM to generate a complete example project in any language and on any version/provider compatible with Apache TinkerPop.