Skip to content

Monitoring the verification endpoint

This guide covers what metrics to expose from the verification endpoint, what alert conditions to configure, and how to integrate with Prometheus and OpenTelemetry.


Key metrics

Request counters

Metric Type Labels Description
agent_manifest_verifications_total Counter result (VALID/MISMATCH/EXPIRED/REVOKED/INCOMPLETE/ERROR) Total verification requests
agent_manifest_revocation_checks_total Counter result (hit/miss/error) CRL checks performed
agent_manifest_manifests_active Gauge attestation_level Manifests in the store by attestation level

Latency histograms

Metric Buckets Description
agent_manifest_verification_duration_seconds [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0] End-to-end verification latency
agent_manifest_revocation_check_duration_seconds [0.001, 0.005, 0.01, 0.05, 0.1, 0.5] CRL fetch + lookup latency

Adding Prometheus instrumentation

Install prometheus-client and wrap the verification router:

from fastapi import FastAPI, Request, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time

VERIFICATIONS = Counter(
    "agent_manifest_verifications_total",
    "Total verification requests",
    ["result"],
)
VERIFICATION_LATENCY = Histogram(
    "agent_manifest_verification_duration_seconds",
    "End-to-end verification latency",
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0],
)
REVOCATION_LATENCY = Histogram(
    "agent_manifest_revocation_check_duration_seconds",
    "CRL lookup latency",
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5],
)
ACTIVE_MANIFESTS = Gauge(
    "agent_manifest_manifests_active",
    "Active manifests by attestation level",
    ["attestation_level"],
)

app = FastAPI()

@app.middleware("http")
async def record_verification_metrics(request: Request, call_next):
    if request.url.path == "/verify":
        start = time.perf_counter()
        response = await call_next(request)
        duration = time.perf_counter() - start
        VERIFICATION_LATENCY.observe(duration)
        return response
    return await call_next(request)

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Call VERIFICATIONS.labels(result=result.result.value).inc() in your verification handler after each call.


OpenTelemetry integration

If your stack uses OpenTelemetry instead of direct Prometheus:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider

provider = MeterProvider()
metrics.set_meter_provider(provider)
meter = metrics.get_meter("agent-manifest")

verifications = meter.create_counter(
    "agent_manifest.verifications",
    description="Total verification requests",
)
verification_latency = meter.create_histogram(
    "agent_manifest.verification.duration",
    unit="s",
    description="End-to-end verification latency",
)

# In your handler:
verifications.add(1, {"result": result.result.value})
verification_latency.record(duration, {"result": result.result.value})

Alert conditions

Critical alerts (page immediately)

Condition PromQL Meaning
INVALID spike rate(agent_manifest_verifications_total{result="MISMATCH"}[5m]) > 0.1 Possible artifact tampering or replay attack
REVOKED spike rate(agent_manifest_verifications_total{result="REVOKED"}[5m]) > 0.5 Active incident - multiple manifests being revoked
Verifier unreachable absent(agent_manifest_verifications_total) Verification sidecar is down

Warning alerts (page next business day)

Condition PromQL Meaning
High p99 latency histogram_quantile(0.99, rate(agent_manifest_verification_duration_seconds_bucket[5m])) > 0.2 CRL or Rekor lookup is slow
EXPIRED manifests accumulating rate(agent_manifest_verifications_total{result="EXPIRED"}[1h]) > 0.01 Issuance pipeline not refreshing manifests
Level 0 agents above threshold agent_manifest_manifests_active{attestation_level="0"} > 5 Unattested agents in production

Prometheus alert rules

groups:
  - name: agent-manifest
    rules:
      - alert: ManifestTamperingDetected
        expr: rate(agent_manifest_verifications_total{result="MISMATCH"}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Manifest artifact mismatch rate elevated
          description: Possible artifact tampering or key compromise. Rate = {{ $value }} req/s.

      - alert: ManifestRevocationSpike
        expr: rate(agent_manifest_verifications_total{result="REVOKED"}[5m]) > 0.5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: High revocation rate detected
          description: Multiple manifests being revoked. Rate = {{ $value }} req/s. Check for active incident.

      - alert: VerificationLatencyHigh
        expr: histogram_quantile(0.99, rate(agent_manifest_verification_duration_seconds_bucket[5m])) > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Verification p99 latency above 200ms
          description: Check CRL endpoint availability and Rekor response times.

Grafana dashboard

Key panels for a verification endpoint dashboard:

Row 1: Request health - Panel: rate(agent_manifest_verifications_total[5m]) by result - stacked area chart - Panel: rate(agent_manifest_verifications_total{result!="VALID"}[5m]) - single stat with alert threshold

Row 2: Latency - Panel: histogram_quantile(0.50|0.95|0.99, rate(agent_manifest_verification_duration_seconds_bucket[5m])) - line chart - Panel: histogram_quantile(0.99, rate(agent_manifest_revocation_check_duration_seconds_bucket[5m])) - single stat

Row 3: Fleet health - Panel: agent_manifest_manifests_active by attestation_level - bar gauge - Panel: rate(agent_manifest_verifications_total{result="EXPIRED"}[1h]) - single stat

SLO targets

Metric Target
Verification success rate (VALID) ≥ 99.5%
p99 verification latency < 50ms (local CRL) / < 200ms (remote CRL)
Revocation propagation time < 30s
Uptime (verifier reachable) 99.9%

What each non-VALID result means operationally

Result Frequency in healthy system Cause Response
MISMATCH Rare (< 0.01%) Artifact changed after issuance Investigate - possible tamper
EXPIRED Low (< 0.1%) Manifest not refreshed before expiry Fix issuance pipeline
REVOKED Rare (near zero) Expected after revocation event Confirm revocation was intentional
INCOMPLETE None HITL required but missing Fix approval workflow
ATTESTATION_UNAVAILABLE Rare in production Hardware provider unavailable Check attestation hardware
ERROR Near zero Malformed manifest or unexpected exception Check logs