Monitoring the verification endpoint¶

This guide covers what metrics to expose from the verification endpoint, what alert conditions to configure, and how to integrate with Prometheus and OpenTelemetry.

Key metrics¶

Request counters¶

Metric	Type	Labels	Description
`agent_manifest_verifications_total`	Counter	`result` (VALID/MISMATCH/EXPIRED/REVOKED/INCOMPLETE/ERROR)	Total verification requests
`agent_manifest_revocation_checks_total`	Counter	`result` (hit/miss/error)	CRL checks performed
`agent_manifest_manifests_active`	Gauge	`attestation_level`	Manifests in the store by attestation level

Latency histograms¶

Metric	Buckets	Description
`agent_manifest_verification_duration_seconds`	`[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]`	End-to-end verification latency
`agent_manifest_revocation_check_duration_seconds`	`[0.001, 0.005, 0.01, 0.05, 0.1, 0.5]`	CRL fetch + lookup latency

Adding Prometheus instrumentation¶

Install prometheus-client and wrap the verification router:

from fastapi import FastAPI, Request, Response
from prometheus_client import Counter, Histogram, Gauge, generate_latest, CONTENT_TYPE_LATEST
import time

VERIFICATIONS = Counter(
    "agent_manifest_verifications_total",
    "Total verification requests",
    ["result"],
)
VERIFICATION_LATENCY = Histogram(
    "agent_manifest_verification_duration_seconds",
    "End-to-end verification latency",
    buckets=[0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0],
)
REVOCATION_LATENCY = Histogram(
    "agent_manifest_revocation_check_duration_seconds",
    "CRL lookup latency",
    buckets=[0.001, 0.005, 0.01, 0.05, 0.1, 0.5],
)
ACTIVE_MANIFESTS = Gauge(
    "agent_manifest_manifests_active",
    "Active manifests by attestation level",
    ["attestation_level"],
)

app = FastAPI()

@app.middleware("http")
async def record_verification_metrics(request: Request, call_next):
    if request.url.path == "/verify":
        start = time.perf_counter()
        response = await call_next(request)
        duration = time.perf_counter() - start
        VERIFICATION_LATENCY.observe(duration)
        return response
    return await call_next(request)

@app.get("/metrics")
def metrics():
    return Response(generate_latest(), media_type=CONTENT_TYPE_LATEST)

Call VERIFICATIONS.labels(result=result.result.value).inc() in your verification handler after each call.

OpenTelemetry integration¶

If your stack uses OpenTelemetry instead of direct Prometheus:

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider

provider = MeterProvider()
metrics.set_meter_provider(provider)
meter = metrics.get_meter("agent-manifest")

verifications = meter.create_counter(
    "agent_manifest.verifications",
    description="Total verification requests",
)
verification_latency = meter.create_histogram(
    "agent_manifest.verification.duration",
    unit="s",
    description="End-to-end verification latency",
)

# In your handler:
verifications.add(1, {"result": result.result.value})
verification_latency.record(duration, {"result": result.result.value})

Alert conditions¶

Critical alerts (page immediately)¶

Condition	PromQL	Meaning
INVALID spike	`rate(agent_manifest_verifications_total{result="MISMATCH"}[5m]) > 0.1`	Possible artifact tampering or replay attack
REVOKED spike	`rate(agent_manifest_verifications_total{result="REVOKED"}[5m]) > 0.5`	Active incident - multiple manifests being revoked
Verifier unreachable	`absent(agent_manifest_verifications_total)`	Verification sidecar is down

Warning alerts (page next business day)¶

Condition	PromQL	Meaning
High p99 latency	`histogram_quantile(0.99, rate(agent_manifest_verification_duration_seconds_bucket[5m])) > 0.2`	CRL or Rekor lookup is slow
EXPIRED manifests accumulating	`rate(agent_manifest_verifications_total{result="EXPIRED"}[1h]) > 0.01`	Issuance pipeline not refreshing manifests
Level 0 agents above threshold	`agent_manifest_manifests_active{attestation_level="0"} > 5`	Unattested agents in production

Prometheus alert rules¶

groups:
  - name: agent-manifest
    rules:
      - alert: ManifestTamperingDetected
        expr: rate(agent_manifest_verifications_total{result="MISMATCH"}[5m]) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: Manifest artifact mismatch rate elevated
          description: Possible artifact tampering or key compromise. Rate = {{ $value }} req/s.

      - alert: ManifestRevocationSpike
        expr: rate(agent_manifest_verifications_total{result="REVOKED"}[5m]) > 0.5
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: High revocation rate detected
          description: Multiple manifests being revoked. Rate = {{ $value }} req/s. Check for active incident.

      - alert: VerificationLatencyHigh
        expr: histogram_quantile(0.99, rate(agent_manifest_verification_duration_seconds_bucket[5m])) > 0.2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: Verification p99 latency above 200ms
          description: Check CRL endpoint availability and Rekor response times.

Grafana dashboard¶

Key panels for a verification endpoint dashboard:

Row 1: Request health - Panel: rate(agent_manifest_verifications_total[5m]) by result - stacked area chart - Panel: rate(agent_manifest_verifications_total{result!="VALID"}[5m]) - single stat with alert threshold

Row 2: Latency - Panel: histogram_quantile(0.50|0.95|0.99, rate(agent_manifest_verification_duration_seconds_bucket[5m])) - line chart - Panel: histogram_quantile(0.99, rate(agent_manifest_revocation_check_duration_seconds_bucket[5m])) - single stat

Row 3: Fleet health - Panel: agent_manifest_manifests_active by attestation_level - bar gauge - Panel: rate(agent_manifest_verifications_total{result="EXPIRED"}[1h]) - single stat

SLO targets

Metric	Target
Verification success rate (VALID)	≥ 99.5%
p99 verification latency	< 50ms (local CRL) / < 200ms (remote CRL)
Revocation propagation time	< 30s
Uptime (verifier reachable)	99.9%

What each non-VALID result means operationally¶

Result	Frequency in healthy system	Cause	Response
MISMATCH	Rare (< 0.01%)	Artifact changed after issuance	Investigate - possible tamper
EXPIRED	Low (< 0.1%)	Manifest not refreshed before expiry	Fix issuance pipeline
REVOKED	Rare (near zero)	Expected after revocation event	Confirm revocation was intentional
INCOMPLETE	None	HITL required but missing	Fix approval workflow
ATTESTATION_UNAVAILABLE	Rare in production	Hardware provider unavailable	Check attestation hardware
ERROR	Near zero	Malformed manifest or unexpected exception	Check logs