Skip to main content

Monitoring and Observability

The eduID Wallet Matching Portal provides several mechanisms for monitoring the health and operational status of the system. Because the system handles sensitive identity data, traditional application performance monitoring (APM) approaches must be adapted to avoid inadvertently logging or exposing personal information. The monitoring strategy relies on aggregate counts, event types, and operational metadata rather than data content.

This page covers the key monitoring areas: audit event analysis, database-level health metrics, background job monitoring, key migration tracking, and service health indicators.

Audit Events

The audit_event table is the primary source of operational visibility into the system. Every significant operation -- identity lookups, binding creation, reconciliation flows, GDPR erasure requests, and key migrations -- generates an audit event. Each event includes a correlation_id that links related events across a single user flow, enabling end-to-end tracing of authentication and reconciliation operations. These events provide a structured, queryable record of system activity without exposing personal data.

For the complete event type reference, structure details, and querying patterns, see the Audit Trail page.

Key Metrics from Audit Events

Audit events can be aggregated to derive several important operational metrics. The following table lists the most useful metrics, how to compute them, and what thresholds should trigger alerts.

MetricQuery ApproachNormal RangeAlert Threshold
Reconciliation rateCount idv_completed events per hourVaries by deploymentSudden spike may indicate an automated attack or misconfigured client
Reconciliation completion rateRatio of idv_completed to idv_initiated80-95%Sustained drop below 70% indicates upstream provider issues or UX problems
Authentication failuresCount session_failed events per hourNear zeroSustained increase needs investigation; high volume may indicate credential stuffing
GDPR erasuresCount gdpr_erasure events per dayLow, sporadicUnexpected bulk erasures require immediate investigation
External API usageCount external_api_lookup events by client_id per hourVaries per clientUnusual patterns or unknown client IDs warrant review
Binding creation rateCount binding_created events per dayProportional to reconciliation rateDivergence from reconciliation rate suggests errors in binding logic

Monitoring Approach

Audit events are best consumed through a log aggregation pipeline. The recommended approach is to forward audit events (or the application logs that record them) to a centralized logging system such as Elasticsearch/Kibana, Azure Monitor, or Grafana Loki. This enables:

  • Real-time dashboards showing event volume by type
  • Alerting rules triggered by abnormal event patterns
  • Historical analysis for compliance reporting and incident investigation

Capacity Monitoring via Database Count Queries

The Auth Bridge database includes several named SqlDelight queries designed specifically for monitoring purposes. These queries return aggregate counts rather than individual records, making them safe to expose to monitoring systems without risk of data leakage.

Available Count Queries

QueryTableDescriptionRecommended Check Interval
countMatchesidentity_matchReturns the total number of active (non-deleted) identity match records for a given tenant.Daily
countAllBindingsidentity_link_bindingReturns the total number of identity link binding records for a given tenant.Daily
countInactiveBindingsidentity_link_bindingReturns the number of bindings where last_used_at is older than the inactivity threshold.Weekly
countAllAuxauxiliary_dataReturns the total number of auxiliary data records for a given tenant.Daily

How to Use These Metrics

These counts serve several important operational purposes:

  • Capacity planning: Track the growth rate of each table over time to project storage requirements and database performance characteristics. If the identity match table is growing by 1,000 records per month, you can extrapolate when index sizes will require attention.
  • Cleanup validation: After the inactive binding cleanup job runs, the countInactiveBindings value should decrease (or remain stable if new bindings are aging into the inactive window at the same rate they are being cleaned up). If this number only ever increases, the cleanup job may not be running.
  • Tenant monitoring: Compare counts across tenants to detect disproportionate growth. A tenant with 10x the bindings of other tenants may need investigation (is this legitimate growth, or a misconfigured client creating duplicate bindings?).
  • Ratio analysis: The ratio of countAllBindings to countMatches indicates how many institutional links each identity has on average. A ratio significantly greater than 1.0 may indicate that some identities have an unusually large number of bindings.

Dashboard Recommendations

Recommended dashboard panels for a monitoring system:

  • Total identities (from countMatches): Line chart over time, per tenant.
  • Total bindings (from countAllBindings): Line chart over time, per tenant.
  • Inactive binding ratio (countInactiveBindings / countAllBindings): Gauge showing what percentage of bindings are approaching the inactivity threshold.
  • Auxiliary data volume (from countAllAux): Bar chart per tenant.

Background Job Monitoring

The Auth Bridge runs several background jobs that perform essential maintenance tasks. Each job should be monitored to ensure it is running on schedule and completing successfully.

Session Cleanup (Every 5 Minutes)

The session cleanup job removes expired reconciliation sessions from the reconciliation_session table. Reconciliation sessions are inherently ephemeral (typical TTL is 5 minutes), so expired sessions should be cleaned up promptly to prevent the table from growing unboundedly and to minimize the window during which sensitive temporary data (PKCE code verifiers, OIDC nonces) exists in the database.

IndicatorNormal RangeConcern
Sessions deleted per run0-50Hundreds of expired sessions per cleanup cycle may indicate that the session TTL is too short, that users are abandoning sessions frequently, or that an automated process is creating sessions without completing them.
Time since last cleanup0-10 minutesIf the cleanup job has not run for significantly longer than the configured interval (default 5 minutes), the background scheduler may be stuck, the service may be under memory pressure, or the JVM may be in a prolonged garbage collection pause.
Total active sessions0-100A very large number of active (non-expired) sessions may indicate a session creation leak or an ongoing denial-of-service attack.
Accumulated expired sessionsNear zero between runsIf the count of expired sessions grows despite cleanup running, the job may be failing silently or the session creation rate exceeds cleanup capacity. The query findExpiredSessions with the current timestamp should return zero or near-zero results between cleanup runs.

The session cleanup job logs the number of deleted sessions at the INFO level after each run. Monitor application logs for messages like:

Session cleanup completed: deleted 12 expired sessions

Inactive Binding Cleanup (Every 60 Minutes)

The retention cleanup job scans for identity link bindings that have not been accessed within the configured inactivity threshold (default: 730 days / 2 years) and soft-deletes them. It also performs hard deletes on records that have been soft-deleted for longer than the retention period (default: 30 days).

IndicatorNormal RangeConcern
Bindings soft-deleted per run0-10A large number of soft-deletions in a single run may indicate a mass inactivity event (e.g., an institution discontinuing the service) or a recent configuration change that lowered the inactivity threshold.
Hard deletions per run0-10Hard deletions remove data permanently. A sudden increase may indicate that a batch of soft-deleted records from a previous bulk operation has reached the end of the retention period.
Soft-delete backlogLow and stableIf soft-deleted records are accumulating without being hard-deleted, the hard-delete job may not be running or may be encountering errors.

The cleanup job logs its activity at the INFO level:

Inactive binding cleanup: soft-deleted 3 bindings, hard-deleted 7 past-retention records

Auxiliary Data Expiration

Expired auxiliary data (records where expires_at is in the past) is cleaned up alongside the general cleanup jobs. Monitor findExpiredAux results to ensure that records with expires_at in the past are being removed promptly. Auxiliary data with no expires_at value is retained indefinitely (subject to identity-level GDPR erasure).

Key Migration Tracking

Key rotation is a critical security operation, and monitoring its progress is essential to ensure that all records are successfully migrated to the new key version. The key_migration_history table provides detailed status and progress information for each migration.

Migration Status Values

StatusDescriptionExpected Duration
IN_PROGRESSThe migration is currently running. Records are being re-hashed or re-encrypted with the new key version.Minutes to hours, depending on data volume.
COMPLETEDThe migration finished successfully. All records have been migrated to the new key version.Terminal state.
FAILEDThe migration encountered an error that prevented completion. Check records_failed for the count of unmigrated records.Terminal state. Requires investigation.
ROLLED_BACKThe migration was rolled back to the previous key version, either automatically on failure or manually by an operator.Terminal state.

Migration Progress Counters

CounterDescriptionWhat to Watch
records_processedNumber of records successfully re-hashed or re-encrypted.Should increase steadily during an IN_PROGRESS migration. Stalling indicates a problem.
records_failedNumber of records that could not be migrated due to errors.Any non-zero value requires immediate investigation. Failed records remain on the old key version and may become inaccessible if the old key is decommissioned.
records_skippedNumber of records skipped because they were already at the target key version.A high skip count is normal if the migration was restarted after a partial completion.
records_purgedNumber of inactive records deleted instead of migrated.These are soft-deleted records past their retention period that were cleaned up opportunistically during migration. A high purge count is acceptable.

Alerting Rules for Key Migrations

The following alerting rules should be configured for any environment that performs key rotations:

  • Alert on FAILED status: Any migration that enters the FAILED state should trigger an immediate alert to the operations team. The old key must be retained in the KMS until the failure is resolved and the migration is retried or completed manually.
  • Alert on IN_PROGRESS exceeding expected duration: Set a timeout alert based on the expected migration duration for your data volume. A general guideline is to allow 1 second per 100 records; a migration of 100,000 records should complete within roughly 20 minutes. A migration that exceeds 2x the expected duration may need manual intervention.
  • Alert on records_failed greater than zero: Even a single failed record means that some data was not migrated. This record will remain on the old key version, and if that key is decommissioned, the data will be unrecoverable.
  • Alert if migration has not started after key configuration change: If the configuration is updated with a new key version but no migration record appears in key_migration_history, the migration may not have been triggered.

Database Connection Pool Monitoring

The Auth Bridge maintains a connection pool to PostgreSQL (default: max-pool-size: 5). Connection pool health is a critical factor in overall service performance, as pool exhaustion is one of the most common causes of service degradation in database-backed applications.

Key Metrics

MetricDescriptionConcern Threshold
Active connectionsNumber of connections currently checked out from the pool and in use.Sustained values near max-pool-size indicate the pool may be undersized for the workload.
Idle connectionsNumber of connections waiting in the pool, available for use.Zero idle connections under load indicates all connections are in use and new requests must wait.
Connection wait timeTime a database request waits for a connection to become available in the pool.Any non-zero wait time indicates pool contention. Sustained wait times directly degrade response latency.
Connection creation failuresFailed attempts to establish new connections to PostgreSQL.Any failure indicates a PostgreSQL connectivity issue (network, authentication, max_connections exceeded).
Slow queriesQueries taking longer than expected to execute.May indicate missing indexes, lock contention, or resource exhaustion on the PostgreSQL server. Particularly watch for slow queries during key migration.

Pool Sizing Guidelines

For production deployments, size the connection pool based on:

  • Expected concurrent reconciliation sessions: Each active session may hold a connection during database operations (session creation, status update, binding write).
  • External API request concurrency: Each API request requires a connection for the identity lookup query.
  • Background job connections: Session cleanup, inactive binding cleanup, and key migration each require connections when they run.

A common rule of thumb is to set the pool size to 2-3x the expected peak concurrent request count. However, never set it higher than the PostgreSQL max_connections setting minus a buffer for administrative connections and connections from other services (such as the STS).

Health Check Endpoints

Each service exposes a health check endpoint that can be used for monitoring and orchestration. These endpoints are lightweight and designed to be polled frequently without impacting service performance.

Service Health Summary

ServiceEndpointChecks PerformedHealthy Response
Portal (Frontend)GET /api/healthNext.js server running, API routes accessible200 OK with {"status": "ok"}
STSGET /.well-known/openid-configurationJVM running, OIDC discovery document can be generated200 OK with the OIDC discovery document
Auth BridgeGET /healthJVM running, database accessible, KMS accessible200 OK with readiness details
PostgreSQLTCP connection on port 5432PostgreSQL accepting connectionsConnection accepted

Auth Bridge Health Details

The Auth Bridge health endpoint provides additional detail about the readiness of its subsystems:

SubsystemHealthy WhenUnhealthy Indicator
DatabaseConnection pool has available connections and a test query succeeds.Connection failures, query timeouts, pool exhaustion.
KMSThe configured KMS provider responds to a test operation.KMS connectivity failure, authentication failure, key not found.
Session CleanupThe background job has run within the last 2x the configured interval.Job has not run for an extended period, indicating the scheduler may be stuck.
Inactive Binding CleanupThe background job has run within the last 2x the configured interval.Job has not run for an extended period.

Verifying Health After Startup

After starting the services, verify that all health checks pass:

# Check portal health
curl -s http://localhost:3000/api/health | jq .

# Check STS health (via OIDC discovery)
curl -s http://localhost:8092/.well-known/openid-configuration | jq .

# Check Auth Bridge health
curl -s http://localhost:8090/health | jq .

# Check PostgreSQL health
pg_isready -h localhost -p 5432 -U portal

Integration with Orchestrators

In Docker Compose, health checks are configured directly in the docker-compose.yml file. Docker uses these to determine when a service is ready and to manage startup ordering of dependent services.

In Kubernetes, configure liveness and readiness probes on each pod:

  • Liveness probe: Determines whether the container should be restarted. Use for unrecoverable failures (e.g., deadlocked threads, corrupted state).
  • Readiness probe: Determines whether the container should receive traffic. Use for temporary conditions (e.g., database connection issues, KMS initialization still in progress).

Operational Runbooks

The following runbooks provide step-by-step guidance for investigating common operational scenarios.

High Reconciliation Volume

If reconciliation events spike unexpectedly:

  1. Check audit events for idv_initiated and idv_completed patterns. Is the spike correlated with a specific tenant or client?
  2. Verify the reconciliation selector rules have not changed recently (a misconfiguration could cause unnecessary re-reconciliation).
  3. Check if a key rotation caused bindings to become unresolvable, triggering mass re-reconciliation.
  4. Review external API usage for unusual patterns that might be driving reconciliation requests.
  5. If the spike appears malicious, consider rate-limiting at the load balancer or temporarily disabling the affected provider.

Expired Session Accumulation

If expired sessions are accumulating in the database:

  1. Verify the cleanup job is running by checking application logs for cleanup execution messages.
  2. Check database connectivity -- the cleanup job requires write access to delete expired sessions.
  3. Verify the session-cleanup.interval-minutes configuration has not been accidentally set to an extremely high value.
  4. Check for database lock contention that might prevent the DELETE operations from completing.
  5. As a last resort, manually delete expired sessions with a direct SQL query (after verifying that all expired sessions have passed their TTL).

Key Migration Stalled

If a key migration stops progressing:

  1. Check the key_migration_history table for the current status and records_failed count.
  2. Review application logs for migration-related error messages (search for the migration version number).
  3. Verify that both the old and new keys are accessible in the KMS. A common cause of stalled migrations is that the old key was decommissioned before the migration completed.
  4. Check database connection pool availability. The migration process uses connections from the same pool as normal operations, so a heavily loaded system may starve the migration of connections.
  5. If the migration is stuck on specific records, investigate those records individually. They may have corrupted data or reference a key version that no longer exists in the KMS.