Skip to main content
Version: v0.25.0 (Latest)

Telemetry & Observability

The EDK's telemetry system gives you visibility into what your application is doing at runtime, which commands are executing, how long they take, where time is spent across service boundaries, and how errors propagate through the system. It covers the three pillars of observability: distributed tracing, metrics, and log correlation.

The system is built as a platform-agnostic abstraction with an OpenTelemetry backend. Application code uses the EDK's TracerProvider and MetricsCollector interfaces, it never imports OpenTelemetry directly. This means telemetry works identically across Kotlin Multiplatform targets (JVM, JS, Node.js), and the backend can be swapped without changing application code.

Why Telemetry Matters for the EDK

In a command-oriented architecture where operations can execute locally or be forwarded to remote microservices via command transport, understanding what happens across service boundaries is essential. A single user action might trigger a chain of commands: authenticate → resolve identity → check policy → generate key → sign credential. Without tracing, debugging a slow response or intermittent failure in that chain means reading logs from multiple services and manually correlating timestamps.

With the telemetry system:

  • A trace follows the entire chain from the initial HTTP request through every command and remote call, with a single trace ID that links everything together
  • Spans within the trace show where time is spent, 200ms in policy evaluation, 50ms in KMS, 10ms in the database
  • Metrics track command success/failure rates, latency distributions, and concurrent execution counts
  • Log correlation injects the trace ID and span ID into application logs, so you can filter your log aggregator (Elasticsearch, Loki, Datadog) by trace ID and see every log line from every service that participated in that request

Architecture

The telemetry system uses a two-layer architecture:

Public API (lib-telemetry-public) defines the interfaces that all application code uses: TracerProvider, Tracer, Span, SpanBuilder, MetricsCollector, and TraceContext. This module has no OpenTelemetry dependencies and supports Kotlin Multiplatform.

OTel Implementation (lib-telemetry-otel) adapts the OpenTelemetry SDK to the public API. It provides OtelTracerProvider, OtelMetricsCollector, and OtelContextPropagator, JVM-only implementations backed by the OTel SDK. When this module is on the classpath, Metro DI automatically replaces the no-op defaults with the OTel implementations.

When lib-telemetry-otel is not on the classpath, all telemetry calls go to zero-overhead no-op implementations. There's no runtime cost, method calls return immediately, no objects are allocated, no data is collected. This means libraries can instrument their code unconditionally; the cost is zero unless someone explicitly opts in by adding the OTel module.

Distributed Tracing

Trace Context

Every trace starts with a TraceContext that carries a W3C-compliant trace ID and span ID:

data class TraceContext(
val traceId: String, // 32-char hex (128 bits)
val spanId: String, // 16-char hex (64 bits)
val parentSpanId: String?,
val traceFlags: Int = 0 // Sampled bit
)

The TraceContext can be serialized to/from the standard traceparent header (00-{traceId}-{spanId}-{flags}), which is how traces propagate across HTTP and gRPC boundaries. When a service receives a request with a traceparent header, it extracts the parent context and creates child spans under it, linking the entire request chain into a single trace.

Creating Spans

Spans represent units of work. Each span has a name, a parent (if it's not the root), attributes (key-value metadata), events (timestamped log entries), and a status (OK or ERROR):

val tracer = tracerProvider.getTracer("com.example.myservice")

val span = tracer.spanBuilder("process-credential")
.setAttribute("credential.type", "VerifiableCredential")
.setAttribute("tenant.id", tenantId)
.startSpan()

try {
span.makeCurrent().use {
// All child operations inherit this span as parent
val result = credentialService.process(input)
span.setStatus(SpanStatus.OK, null)
}
} catch (e: Exception) {
span.recordException(e)
span.setStatus(SpanStatus.ERROR, e.message)
throw e
} finally {
span.end()
}

The makeCurrent() call pushes the span onto the current context (thread-local on JVM, coroutine context element in suspend functions). Any child spans created within that scope automatically become children of this span. This is how a trace tree forms, you don't need to manually pass parent references.

Coroutine Integration

In Kotlin coroutine-based code (which is most EDK code), spans propagate via a SpanContextElement in the coroutine context:

val span = tracer.spanBuilder("background-job").startSpan()

launch(SpanContextElement(span)) {
// This coroutine and all its children carry the span
val metadata = LogCorrelation.currentTraceMetadata()
// metadata = { "traceId": "abc...", "spanId": "def..." }
}

This means suspend functions don't need to accept trace context as a parameter, it flows automatically through the coroutine hierarchy.

Cross-Service Propagation

When the command transport forwards a command to a remote service, the trace context must travel with it. The OtelContextPropagator handles this:

Outbound: before sending an HTTP RPC or gRPC request, the transport injects the current span's context into the request headers as a traceparent header.

Inbound: when receiving a request, the transport extracts the traceparent header and creates a child span under the received parent context.

This happens automatically in the transport layer. Application code doesn't need to manage trace propagation, it creates spans for its own work, and the transport ensures those spans are linked across service boundaries.

The result is an end-to-end trace that shows the full path of a request:

Distributed Trace Waterfall

Each service contributes its spans to the same trace, and tools like Jaeger, Zipkin, or Grafana Tempo render the full tree.

Metrics

The MetricsCollector provides three instrument types:

Counters track how many times something happens. Command executions, authorization denials, cache hits/misses, HTTP requests by status code:

metricsCollector.incrementCounter(
"command.executed",
mapOf("command" to "kms.keys.generate", "result" to "success")
)

Histograms track the distribution of values, typically latencies. Command execution time, PDP evaluation time, database query time. Histograms give you percentiles (p50, p95, p99) rather than just averages:

metricsCollector.recordHistogram(
"command.duration_ms",
durationMs.toDouble(),
mapOf("command" to "kms.keys.generate")
)

Gauges track current values that go up and down, active connections, queue depth, cache size:

metricsCollector.setGauge(
"connections.active",
connectionPool.activeCount.toDouble(),
mapOf("database" to "primary")
)

All instruments support tags (key-value pairs) for dimensional filtering. The OTel implementation creates instruments lazily on first use and caches them for thread-safe reuse.

Exporting Metrics

The OTel SDK supports multiple metrics exporters:

ExporterProtocolUse case
OTLPgRPC or HTTPOpenTelemetry Collector, Grafana Cloud, Datadog
PrometheusPull-based scrapePrometheus server, Grafana
LoggingstdoutDevelopment, debugging

The exporter is configured via environment variables (see Configuration), your application code doesn't know or care where metrics go.

Log Correlation

Traces and logs serve different purposes, but they're most useful when connected. The LogCorrelation utility extracts trace metadata that you can inject into your structured logs:

val metadata = LogCorrelation.currentTraceMetadata()
// Returns: { "traceId": "abc123...", "spanId": "def456...", "parentSpanId": "..." }
// Returns null if no active trace

In practice, this means you can configure your logging framework to include trace IDs in every log line. When a user reports a problem, you find their trace ID in the API response headers, search your log aggregator for that trace ID, and see every log line from every service that handled their request, without needing to know which services were involved.

The mergeTraceMetadata helper merges trace fields into existing metadata maps, which is useful when building audit events or structured log entries that already carry other context.

Command Integration

The telemetry system integrates with the EDK's command lifecycle through the same extension mechanism used by authorization and audit. A telemetry extension can:

  • Before execution: create a span named after the command, set attributes for the command ID, tenant, and principal
  • After execution: set the span status (OK/ERROR), record the duration, increment success/failure counters
  • On error: record the exception on the span for error analysis

This gives you automatic tracing and metrics for every command without instrumenting individual command implementations.

Configuration

The telemetry system is configured through TelemetryConfig and standard OpenTelemetry environment variables.

TelemetryConfig

data class TelemetryConfig(
val enabled: Boolean = true,
val serviceName: String = "sphereon-vdx",
val serviceVersion: String = "0.14.0",
val tracingEnabled: Boolean = true,
val metricsEnabled: Boolean = true,
val samplingRate: Double = 1.0 // 0.0 to 1.0
)
PropertyDefaultDescription
enabledtrueMaster switch. When false, returns OpenTelemetry.noop(), zero overhead.
serviceNamesphereon-vdxIdentifies your service in trace visualizers and metric dashboards
serviceVersion0.14.0Version tag for distinguishing deployments
tracingEnabledtrueEnable/disable trace collection independently of metrics
metricsEnabledtrueEnable/disable metric collection independently of tracing
samplingRate1.0Probability that a trace is sampled. 1.0 = all traces, 0.1 = 10% of traces. Lower rates reduce overhead in high-throughput services.

OpenTelemetry Environment Variables

The OTel SDK reads standard environment variables for exporter configuration:

# Service identity
OTEL_SERVICE_NAME=my-service
OTEL_SERVICE_VERSION=1.0.0

# Trace export
OTEL_TRACES_EXPORTER=otlp # otlp, logging, zipkin, none
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

# Metrics export
OTEL_METRICS_EXPORTER=otlp # otlp, logging, prometheus, none

# Sampling
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # 10% sampling

For development, OTEL_TRACES_EXPORTER=logging writes traces to stdout, useful for verifying instrumentation without deploying a collector.

Production Setup

A typical production setup routes telemetry to an OpenTelemetry Collector, which then forwards to your observability platform:

# docker-compose
services:
otel-collector:
image: otel/opentelemetry-collector:latest
ports:
- "4318:4318" # OTLP HTTP
- "4317:4317" # OTLP gRPC
volumes:
- ./otel-config.yaml:/etc/otel/config.yaml

app:
environment:
- OTEL_SERVICE_NAME=my-service
- OTEL_TRACES_EXPORTER=otlp
- OTEL_METRICS_EXPORTER=otlp
- OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318

The collector can export to Jaeger, Zipkin, Grafana Tempo, Datadog, New Relic, or any OTLP-compatible backend.

Modules

ModulePlatformDescription
lib-telemetry-publicKotlin MultiplatformPlatform-agnostic interfaces (TracerProvider, MetricsCollector, TraceContext, LogCorrelation), no-op defaults
lib-telemetry-otelJVM onlyOpenTelemetry SDK adapter (OTel API 1.44.1, SDK auto-configuration, OTLP + logging exporters)