Observability Guide¶

Metrics, tracing, and health monitoring for RCT Platform deployments.

Overview¶

RCT Platform exposes observability through three layers:

Layer	What it measures	Endpoint / Method
Health probes	Liveness + readiness	`GET /health`, `GET /health/detailed`
FDIA metrics	Scoring throughput, avg score, tier distribution	`GET /metrics`
Audit trail	Full signed decision chain per intent	`GET /audit/{intent_id}`

Health Endpoints¶

Liveness — `GET /health`¶

Fast probe suitable for Kubernetes livenessProbe:

curl http://localhost:8000/health

{
  "status": "healthy",
  "timestamp": "2026-04-21T10:00:00+00:00"
}

HTTP 200 = healthy. HTTP 503 = degraded.

Readiness — `GET /health/detailed`¶

Full component check suitable for Kubernetes readinessProbe:

curl http://localhost:8000/health/detailed

{
  "status": "healthy",
  "version": "1.0.2a0",
  "components": {
    "fdia_engine": "ok",
    "delta_engine": "ok",
    "signedai_registry": "ok",
    "intent_compiler": "ok",
    "dsl_parser": "ok",
    "policy_engine": "ok"
  },
  "timestamp": "2026-04-21T10:00:00+00:00"
}

Any component with "error" status causes the overall status to become "degraded".

CLI Metrics — `rct metrics`¶

The CLI exposes runtime metrics from the in-memory control plane:

rct metrics
rct metrics --output json

Table output:

Metric                        Value
────────────────────────────────────────
Intents processed             1,247
FDIA avg score                0.8734
FDIA min score                0.1200
FDIA max score                0.9980
Warm recall hit rate          89.3%
Consensus tier S2 calls       187
Consensus tier S4 calls       847
Consensus tier S5 calls       213
Policy violations (blocked)   2
Active intents                5
Uptime                        14d 06h 23m

JSON output (for dashboards):

{
  "intents_processed": 1247,
  "fdia_avg": 0.8734,
  "warm_recall_hit_rate": 0.893,
  "policy_violations": 2,
  "uptime_seconds": 1231380
}

Prometheus Integration¶

The Control Plane exposes a Prometheus-compatible metrics endpoint when the ENABLE_PROMETHEUS=1 environment variable is set.

ENABLE_PROMETHEUS=1 rct serve --port 8000

Metrics are available at GET /metrics in Prometheus text format.

Key metrics¶

Metric name	Type	Description
`rct_intents_total`	Counter	Total intents processed
`rct_fdia_score`	Histogram	FDIA score distribution (buckets: 0.1 steps)
`rct_warm_recall_total`	Counter	Number of warm (cached) recalls
`rct_cold_start_total`	Counter	Number of cold-start (full compute) recalls
`rct_consensus_tier_calls`	Counter	Calls per SignedAI tier (label: `tier`)
`rct_policy_violations_total`	Counter	Constitutional blocks (A=0 events)
`rct_processing_duration_ms`	Histogram	Intent processing latency

Prometheus scrape config¶

# prometheus.yml
scrape_configs:
  - job_name: rct-platform
    static_configs:
      - targets: ["localhost:8000"]
    metrics_path: /metrics
    scrape_interval: 15s

Grafana Dashboard¶

Import the sample dashboard from docs/assets/grafana-dashboard.json (included in v1.1.0+ releases) or build your own using the metrics above.

Recommended panels:

Intent throughput — rate(rct_intents_total[5m])
FDIA score heatmap — histogram_quantile(0.95, rct_fdia_score)
Warm vs cold recall — rct_warm_recall_total / (rct_warm_recall_total + rct_cold_start_total)
Policy violations — increase(rct_policy_violations_total[1h])

Kubernetes Probe Configuration¶

# k8s deployment snippet
livenessProbe:
  httpGet:
    path: /health
    port: 8000
  initialDelaySeconds: 10
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /health/detailed
    port: 8000
  initialDelaySeconds: 15
  periodSeconds: 5
  failureThreshold: 2

Audit Trail¶

Every intent generates a signed audit chain accessible via:

rct audit INTENT_ID
rct audit INTENT_ID --output json

The audit record contains:

{
  "intent_id": "a1b2c3d4-...",
  "user_id": "alice",
  "intent_text": "Deploy microservice to staging",
  "created_at": "2026-04-21T10:00:00+00:00",
  "completed_at": "2026-04-21T10:00:02+00:00",
  "fdia_score": 0.8734,
  "data_quality": 0.92,
  "intent_precision": 0.88,
  "architect_score": 0.95,
  "consensus_tier": "S4",
  "consensus_models": ["claude-opus", "kimi-k2", "minimax-m2", "gemini-flash"],
  "policy_status": "COMPLIANT",
  "audit_hash": "sha256:abc123..."
}

The audit_hash is a SHA-256 of the complete record — enables tamper detection.

Structured Logging¶

The Control Plane emits structured JSON logs by default:

{
  "timestamp": "2026-04-21T10:00:00+00:00",
  "level": "INFO",
  "service": "rct-control-plane",
  "intent_id": "a1b2c3d4-...",
  "event": "intent_compiled",
  "fdia_score": 0.8734,
  "duration_ms": 12.4
}

Log level configuration:

# Environment variable
RCT_LOG_LEVEL=DEBUG rct serve

# CLI flag
rct serve --verbose

Level	When to use
`DEBUG`	Development — shows all internal events
`INFO`	Production — intent lifecycle events
`WARNING`	Degraded component, low FDIA score
`ERROR`	Policy violations, consensus failures

Setting Up Local Monitoring Stack¶

# Start full local stack with Prometheus + Grafana
docker compose -f docker-compose.dev.yml -f docker-compose.monitoring.yml up

# Prometheus UI: http://localhost:9090
# Grafana: http://localhost:3000 (admin/admin)

Note

docker-compose.monitoring.yml is planned for v1.1.0. For v1.0.2a0, use rct metrics and the REST endpoints for observability.