Operations
The EDK containers ship with the observability and operational endpoints expected of any production Java service: health, readiness, build metadata, Prometheus metrics, structured logging, distributed tracing through OpenTelemetry, and tamper-evident audit events. This page covers how those surfaces work in practice and what an operator typically wires into them.
Health, Readiness, and Build Metadata
Each container exposes:
GET /health: liveness. Returns200 OKwhile the container is alive and able to serve. Used by Kubernetes liveness probes and container orchestrators.GET /ready: readiness. Returns200 OKonly when the Postgres connection pool is healthy, the KMS dependency is reachable, and the tenant resolver cache has loaded its initial state. Used by Kubernetes readiness probes and load-balancer health checks.GET /version: build metadata. Returns the container build version, the Git commit SHA, the build timestamp, and the IDK version range it was built against. Useful for verifying which version is actually running after a rolling deploy.
Readiness is more stringent than liveness on purpose. A container with a healthy /health but a failing /ready is alive but not yet able to serve traffic. The deployment template uses this to keep traffic off a freshly-started replica until its dependencies are confirmed reachable.
OpenTelemetry
The EDK telemetry module is wired into every container. Spans propagate through the command transport layer, so a single trace ID covers the API gateway, the issuer or verifier protocol handler, the attribute pipeline phases, the KMS signing call, and the webhook dispatch.
The container reads OpenTelemetry configuration from the standard env vars: OTEL_EXPORTER_OTLP_ENDPOINT, OTEL_EXPORTER_OTLP_HEADERS, OTEL_SERVICE_NAME (defaulted by the container to enterprise-issuer, enterprise-verifier, and so on), and the standard sampler configuration. When OTEL is not configured, the telemetry module is a no-op.
Metrics are exposed on /metrics in Prometheus exposition format. The metrics set includes the standard JVM metrics (heap, GC, threads), HTTP server metrics (request rate, latency histograms by route, error rates), command execution metrics (per command id), pipeline metrics on the issuer (phase duration, source duration, deferral rates), DCQL query metrics on the verifier, and KMS call metrics on every data-plane container.
W3C Trace Context propagation is on by default. The traceparent and tracestate headers flow through the EDK transport layer, so a trace started at the API gateway or upstream caller stays continuous through the issuer/verifier/AS/KMS chain.
Structured Logging
Logging is JSON to stdout by default. Each log entry carries the standard severity, message, and exception fields plus EDK-specific structured fields:
tenant_id: the resolved tenant on the call, when applicable.correlation_id: the cross-request correlation identifier.command_id: the EDK command id, when the log entry was emitted inside command execution.trace_idandspan_id, the OpenTelemetry trace context.principal_id: the authenticated principal, when applicable.
For sensitive operations (a credential issuance, an OID4VP presentation, a federation handshake), the log message itself is deliberately abstract; the structured fields carry the operational detail. This keeps the log stream useful for debugging without leaking credential subject claims, federation user attributes, or other PII into a downstream log aggregator.
Audit
Audit is a separate stream from the operational log. Every command execution, authorization decision, authentication event, and admin REST mutation emits a structured audit event with the tenant, the principal, the command id, the result, and the relevant business identifiers. The audit subsystem ships with sensitive-data redaction (configurable per command), multiple output formats (JSON, CEF, OCSF), and tamper evidence via hash chaining plus signed checkpoints.
The default audit sink is a Postgres-backed event store (PostgresDatabaseEventStore). A read REST surface (/api/v1/audit/events) lets platform administrators query by tenant, principal, command, time range, and result. For long-term retention, the audit pipeline can replicate events to an external SIEM through the SSF (Shared Signals Framework) module or through a generic event transmitter.
Audit signing is per-tenant and optional. When enabled, each event is signed with a tenant-scoped audit key on the KMS ((tenant, audit, audit-checkpoint)), and periodic signed checkpoints make tampering with stored events detectable. The default is signing off; enable per tenant through audit.events.signing.enabled.
Backup and Restore
Postgres backup and restore is the principal data-protection story. The EDK does not run its own scheduled backups; the deployment uses whatever backup tooling the operator runs against Postgres in general (pg_basebackup, managed-service snapshots, WAL archiving).
For a clean tenant export (for compliance or for moving a tenant to a different deployment), the tenant export REST emits a self-contained JSON document with all of a tenant's CRUD-managed entities: tenant registration, public-endpoint bindings, integrations, credential designs, attribute supplier registrations, federation providers, DCQL queries, trust source bindings, signing key aliases. Re-importing the document on a target deployment recreates the tenant configuration.
Tenant export does not include credential subject data, presentation records, audit history, or KMS-held key material. Subject data and audit history follow the standard data-retention policy; key material does not move (the new deployment generates new keys for the tenant under its own KMS).
KMS key material is the special case. The provider backends (AWS KMS, Azure Key Vault, HSMs) have their own key backup and recovery stories. The EDK does not export private key material out of the provider backend. For the software keystore provider, the backup is a copy of the keystore file; for everything else, the backup is whatever the backend supports.
Image Publishing and Versioning
Sphereon publishes the five enterprise images to a private commercial registry. Image tags follow the convention :<version>, :latest, and :<git-sha>. Customers pull via standard Docker auth against the registry credentials issued under their commercial agreement.
A typical pull-and-pin pattern in the customer's deployment:
image: registry.sphereon.com/sphereon/enterprise-issuer:0.25.0
Use the version tag in production deployments, not :latest. Verifying the image is straightforward: the build metadata at /version and the SBOM published alongside each image both identify the version, the Git commit, and the included module versions.
Operator Hardening Checklist
A production-ready deployment of the five EDK containers ticks the following:
- Network isolation. KMS internal-only. Admin REST on every data-plane container behind the internal ingress with bearer-JWT auth. Public ingress carries protocol paths and
.well-knownURLs only. NetworkPolicies on Kubernetes scope inter-container traffic to the actual call graph. - TLS. Public ingresses terminate TLS at the gateway. Internal communication uses mTLS (mesh) or service JWT (in-process), per the topology choice.
- Secrets. No secret in YAML, env var, or image. Every secret a
${secret:...}reference resolved through the configured backend. - JWT validation. Admin REST requires a bearer JWT with the right scopes. JWT issuer URL configured per environment. JWKS refresh schedule sized to expected key rotation cadence.
- Postgres. TLS in transit. Backups verified by periodic restore. Connection pool sized to the container's actual throughput, with the pool max below Postgres's
max_connectionsdivided across replicas. - OpenTelemetry. Wired to the deployment's collector. Sampling configured to the desired retention budget.
- Metrics. Prometheus or compatible scraper subscribed to
/metricson every container. Alerts on the standard SLIs: error rate, latency p99, KMS call failure rate, webhook dispatch backlog, Postgres connection pool exhaustion. - Logs. Centralised aggregation. Retention sized for the deployment's compliance requirements.
- Audit. Sink configured (Postgres by default, replication to SIEM if applicable). Signing enabled per tenant where required.
- Image hygiene. Image vulnerability scan in the customer's pull pipeline. SBOM cross-checked against the operator's dependency policy.
- Capacity probes. Load test the deployment against expected traffic before going live. The EDK ships a multi-tenant load test harness as part of the Phase 6b hardening deliverables; tailor it to the tenant count and credential volume of the actual workload.
- Failover. Postgres failover tested. KMS failover (where applicable) tested. The deployment template's readiness probes correctly remove a replica from rotation when its dependencies fail.
- Runbook. Document the deployment's tenant onboarding flow, key rotation procedure, federation provider rotation procedure, and incident response for a credential-compromise event (typically: revoke through the audit pipeline, rotate the affected tenant signing key, re-issue affected credentials).
When Something Goes Wrong
A few standard diagnostics:
/credentialreturning202more often than expected on the issuer container points at attribute sources timing out within thesyncWaitWindow. Check the issuer's pipeline metrics for per-source latency; raise thesyncWaitWindowon the slow source, or move it to a later phase, or accept the deferred flow.- Wallet metadata fetch returning the wrong host URL points at a missing or stale
tenant_public_endpointbinding. Verify the binding through the tenant admin REST; the data plane's URL resolver is fail-closed by default, so an absent binding produces a refusal rather than a silent fallback. - Webhook deliveries failing on one consumer but the dispatcher is healthy points at a per-destination circuit breaker opening. Check the webhook delivery status REST for the affected destination; the circuit breaker auto-closes after the cool-down window once the destination recovers.
- Federation provider connectivity errors are surfaced by the
TenantIdpConnectivitytest endpoint. Re-run the connectivity test after the upstream IdP recovers; the AS keeps the cached JWKS and discovery document until the test passes again. - Cross-replica config not propagating points at the Postgres
LISTEN/NOTIFYbridge being interrupted. The TTL fallback covers this within the configured cache TTL; if a permanent block exists (network policy, Postgres permission), the event subsystem's health metric surfaces it.