Monitoring and Observability
The eduID Wallet Matching Portal provides several mechanisms for monitoring the health and operational status of the system. Because the system handles sensitive identity data, traditional application performance monitoring (APM) approaches must be adapted to avoid inadvertently logging or exposing personal information. The monitoring strategy relies on aggregate counts, event types, and operational metadata rather than data content.
This page covers the key monitoring areas: audit event analysis, database-level health metrics, background job monitoring, key migration tracking, and service health indicators.
Audit Events
The audit_event table is the primary source of operational visibility into the system. Every significant operation -- identity lookups, binding creation, reconciliation flows, GDPR erasure requests, and key migrations -- generates an audit event. Each event includes a correlation_id that links related events across a single user flow, enabling end-to-end tracing of authentication and reconciliation operations. These events provide a structured, queryable record of system activity without exposing personal data.
For the complete event type reference, structure details, and querying patterns, see the Audit Trail page.
Key Metrics from Audit Events
Audit events can be aggregated to derive several important operational metrics. The following table lists the most useful metrics, how to compute them, and what thresholds should trigger alerts.
| Metric | Query Approach | Normal Range | Alert Threshold |
|---|---|---|---|
| Reconciliation rate | Count idv_completed events per hour | Varies by deployment | Sudden spike may indicate an automated attack or misconfigured client |
| Reconciliation completion rate | Ratio of idv_completed to idv_initiated | 80-95% | Sustained drop below 70% indicates upstream provider issues or UX problems |
| Authentication failures | Count session_failed events per hour | Near zero | Sustained increase needs investigation; high volume may indicate credential stuffing |
| GDPR erasures | Count gdpr_erasure events per day | Low, sporadic | Unexpected bulk erasures require immediate investigation |
| External API usage | Count external_api_lookup events by client_id per hour | Varies per client | Unusual patterns or unknown client IDs warrant review |
| Binding creation rate | Count binding_created events per day | Proportional to reconciliation rate | Divergence from reconciliation rate suggests errors in binding logic |
Monitoring Approach
Audit events are best consumed through a log aggregation pipeline. The recommended approach is to forward audit events (or the application logs that record them) to a centralized logging system such as Elasticsearch/Kibana, Azure Monitor, or Grafana Loki. This enables:
- Real-time dashboards showing event volume by type
- Alerting rules triggered by abnormal event patterns
- Historical analysis for compliance reporting and incident investigation
Capacity Monitoring via Database Count Queries
The Auth Bridge database includes several named SqlDelight queries designed specifically for monitoring purposes. These queries return aggregate counts rather than individual records, making them safe to expose to monitoring systems without risk of data leakage.
Available Count Queries
| Query | Table | Description | Recommended Check Interval |
|---|---|---|---|
countMatches | identity_match | Returns the total number of active (non-deleted) identity match records for a given tenant. | Daily |
countAllBindings | identity_link_binding | Returns the total number of identity link binding records for a given tenant. | Daily |
countInactiveBindings | identity_link_binding | Returns the number of bindings where last_used_at is older than the inactivity threshold. | Weekly |
countAllAux | auxiliary_data | Returns the total number of auxiliary data records for a given tenant. | Daily |
How to Use These Metrics
These counts serve several important operational purposes:
- Capacity planning: Track the growth rate of each table over time to project storage requirements and database performance characteristics. If the identity match table is growing by 1,000 records per month, you can extrapolate when index sizes will require attention.
- Cleanup validation: After the inactive binding cleanup job runs, the
countInactiveBindingsvalue should decrease (or remain stable if new bindings are aging into the inactive window at the same rate they are being cleaned up). If this number only ever increases, the cleanup job may not be running. - Tenant monitoring: Compare counts across tenants to detect disproportionate growth. A tenant with 10x the bindings of other tenants may need investigation (is this legitimate growth, or a misconfigured client creating duplicate bindings?).
- Ratio analysis: The ratio of
countAllBindingstocountMatchesindicates how many institutional links each identity has on average. A ratio significantly greater than 1.0 may indicate that some identities have an unusually large number of bindings.
Dashboard Recommendations
Recommended dashboard panels for a monitoring system:
- Total identities (from
countMatches): Line chart over time, per tenant. - Total bindings (from
countAllBindings): Line chart over time, per tenant. - Inactive binding ratio (
countInactiveBindings/countAllBindings): Gauge showing what percentage of bindings are approaching the inactivity threshold. - Auxiliary data volume (from
countAllAux): Bar chart per tenant.
Background Job Monitoring
The Auth Bridge runs several background jobs that perform essential maintenance tasks. Each job should be monitored to ensure it is running on schedule and completing successfully.
Session Cleanup (Every 5 Minutes)
The session cleanup job removes expired reconciliation sessions from the reconciliation_session table. Reconciliation sessions are inherently ephemeral (typical TTL is 5 minutes), so expired sessions should be cleaned up promptly to prevent the table from growing unboundedly and to minimize the window during which sensitive temporary data (PKCE code verifiers, OIDC nonces) exists in the database.
| Indicator | Normal Range | Concern |
|---|---|---|
| Sessions deleted per run | 0-50 | Hundreds of expired sessions per cleanup cycle may indicate that the session TTL is too short, that users are abandoning sessions frequently, or that an automated process is creating sessions without completing them. |
| Time since last cleanup | 0-10 minutes | If the cleanup job has not run for significantly longer than the configured interval (default 5 minutes), the background scheduler may be stuck, the service may be under memory pressure, or the JVM may be in a prolonged garbage collection pause. |
| Total active sessions | 0-100 | A very large number of active (non-expired) sessions may indicate a session creation leak or an ongoing denial-of-service attack. |
| Accumulated expired sessions | Near zero between runs | If the count of expired sessions grows despite cleanup running, the job may be failing silently or the session creation rate exceeds cleanup capacity. The query findExpiredSessions with the current timestamp should return zero or near-zero results between cleanup runs. |
The session cleanup job logs the number of deleted sessions at the INFO level after each run. Monitor application logs for messages like:
Session cleanup completed: deleted 12 expired sessions
Inactive Binding Cleanup (Every 60 Minutes)
The retention cleanup job scans for identity link bindings that have not been accessed within the configured inactivity threshold (default: 730 days / 2 years) and soft-deletes them. It also performs hard deletes on records that have been soft-deleted for longer than the retention period (default: 30 days).
| Indicator | Normal Range | Concern |
|---|---|---|
| Bindings soft-deleted per run | 0-10 | A large number of soft-deletions in a single run may indicate a mass inactivity event (e.g., an institution discontinuing the service) or a recent configuration change that lowered the inactivity threshold. |
| Hard deletions per run | 0-10 | Hard deletions remove data permanently. A sudden increase may indicate that a batch of soft-deleted records from a previous bulk operation has reached the end of the retention period. |
| Soft-delete backlog | Low and stable | If soft-deleted records are accumulating without being hard-deleted, the hard-delete job may not be running or may be encountering errors. |
The cleanup job logs its activity at the INFO level:
Inactive binding cleanup: soft-deleted 3 bindings, hard-deleted 7 past-retention records
Auxiliary Data Expiration
Expired auxiliary data (records where expires_at is in the past) is cleaned up alongside the general cleanup jobs. Monitor findExpiredAux results to ensure that records with expires_at in the past are being removed promptly. Auxiliary data with no expires_at value is retained indefinitely (subject to identity-level GDPR erasure).
Key Migration Tracking
Key rotation is a critical security operation, and monitoring its progress is essential to ensure that all records are successfully migrated to the new key version. The key_migration_history table provides detailed status and progress information for each migration.
Migration Status Values
| Status | Description | Expected Duration |
|---|---|---|
IN_PROGRESS | The migration is currently running. Records are being re-hashed or re-encrypted with the new key version. | Minutes to hours, depending on data volume. |
COMPLETED | The migration finished successfully. All records have been migrated to the new key version. | Terminal state. |
FAILED | The migration encountered an error that prevented completion. Check records_failed for the count of unmigrated records. | Terminal state. Requires investigation. |
ROLLED_BACK | The migration was rolled back to the previous key version, either automatically on failure or manually by an operator. | Terminal state. |
Migration Progress Counters
| Counter | Description | What to Watch |
|---|---|---|
records_processed | Number of records successfully re-hashed or re-encrypted. | Should increase steadily during an IN_PROGRESS migration. Stalling indicates a problem. |
records_failed | Number of records that could not be migrated due to errors. | Any non-zero value requires immediate investigation. Failed records remain on the old key version and may become inaccessible if the old key is decommissioned. |
records_skipped | Number of records skipped because they were already at the target key version. | A high skip count is normal if the migration was restarted after a partial completion. |
records_purged | Number of inactive records deleted instead of migrated. | These are soft-deleted records past their retention period that were cleaned up opportunistically during migration. A high purge count is acceptable. |
Alerting Rules for Key Migrations
The following alerting rules should be configured for any environment that performs key rotations:
- Alert on
FAILEDstatus: Any migration that enters theFAILEDstate should trigger an immediate alert to the operations team. The old key must be retained in the KMS until the failure is resolved and the migration is retried or completed manually. - Alert on
IN_PROGRESSexceeding expected duration: Set a timeout alert based on the expected migration duration for your data volume. A general guideline is to allow 1 second per 100 records; a migration of 100,000 records should complete within roughly 20 minutes. A migration that exceeds 2x the expected duration may need manual intervention. - Alert on
records_failedgreater than zero: Even a single failed record means that some data was not migrated. This record will remain on the old key version, and if that key is decommissioned, the data will be unrecoverable. - Alert if migration has not started after key configuration change: If the configuration is updated with a new key version but no migration record appears in
key_migration_history, the migration may not have been triggered.
Database Connection Pool Monitoring
The Auth Bridge maintains a connection pool to PostgreSQL (default: max-pool-size: 5). Connection pool health is a critical factor in overall service performance, as pool exhaustion is one of the most common causes of service degradation in database-backed applications.
Key Metrics
| Metric | Description | Concern Threshold |
|---|---|---|
| Active connections | Number of connections currently checked out from the pool and in use. | Sustained values near max-pool-size indicate the pool may be undersized for the workload. |
| Idle connections | Number of connections waiting in the pool, available for use. | Zero idle connections under load indicates all connections are in use and new requests must wait. |
| Connection wait time | Time a database request waits for a connection to become available in the pool. | Any non-zero wait time indicates pool contention. Sustained wait times directly degrade response latency. |
| Connection creation failures | Failed attempts to establish new connections to PostgreSQL. | Any failure indicates a PostgreSQL connectivity issue (network, authentication, max_connections exceeded). |
| Slow queries | Queries taking longer than expected to execute. | May indicate missing indexes, lock contention, or resource exhaustion on the PostgreSQL server. Particularly watch for slow queries during key migration. |
Pool Sizing Guidelines
For production deployments, size the connection pool based on:
- Expected concurrent reconciliation sessions: Each active session may hold a connection during database operations (session creation, status update, binding write).
- External API request concurrency: Each API request requires a connection for the identity lookup query.
- Background job connections: Session cleanup, inactive binding cleanup, and key migration each require connections when they run.
A common rule of thumb is to set the pool size to 2-3x the expected peak concurrent request count. However, never set it higher than the PostgreSQL max_connections setting minus a buffer for administrative connections and connections from other services (such as the STS).
Health Check Endpoints
Each service exposes a health check endpoint that can be used for monitoring and orchestration. These endpoints are lightweight and designed to be polled frequently without impacting service performance.
Service Health Summary
| Service | Endpoint | Checks Performed | Healthy Response |
|---|---|---|---|
| Portal (Frontend) | GET /api/health | Next.js server running, API routes accessible | 200 OK with {"status": "ok"} |
| STS | GET /.well-known/openid-configuration | JVM running, OIDC discovery document can be generated | 200 OK with the OIDC discovery document |
| Auth Bridge | GET /health | JVM running, database accessible, KMS accessible | 200 OK with readiness details |
| PostgreSQL | TCP connection on port 5432 | PostgreSQL accepting connections | Connection accepted |
Auth Bridge Health Details
The Auth Bridge health endpoint provides additional detail about the readiness of its subsystems:
| Subsystem | Healthy When | Unhealthy Indicator |
|---|---|---|
| Database | Connection pool has available connections and a test query succeeds. | Connection failures, query timeouts, pool exhaustion. |
| KMS | The configured KMS provider responds to a test operation. | KMS connectivity failure, authentication failure, key not found. |
| Session Cleanup | The background job has run within the last 2x the configured interval. | Job has not run for an extended period, indicating the scheduler may be stuck. |
| Inactive Binding Cleanup | The background job has run within the last 2x the configured interval. | Job has not run for an extended period. |
Verifying Health After Startup
After starting the services, verify that all health checks pass:
# Check portal health
curl -s http://localhost:3000/api/health | jq .
# Check STS health (via OIDC discovery)
curl -s http://localhost:8092/.well-known/openid-configuration | jq .
# Check Auth Bridge health
curl -s http://localhost:8090/health | jq .
# Check PostgreSQL health
pg_isready -h localhost -p 5432 -U portal
Integration with Orchestrators
In Docker Compose, health checks are configured directly in the docker-compose.yml file. Docker uses these to determine when a service is ready and to manage startup ordering of dependent services.
In Kubernetes, configure liveness and readiness probes on each pod:
- Liveness probe: Determines whether the container should be restarted. Use for unrecoverable failures (e.g., deadlocked threads, corrupted state).
- Readiness probe: Determines whether the container should receive traffic. Use for temporary conditions (e.g., database connection issues, KMS initialization still in progress).
Operational Runbooks
The following runbooks provide step-by-step guidance for investigating common operational scenarios.
High Reconciliation Volume
If reconciliation events spike unexpectedly:
- Check audit events for
idv_initiatedandidv_completedpatterns. Is the spike correlated with a specific tenant or client? - Verify the reconciliation selector rules have not changed recently (a misconfiguration could cause unnecessary re-reconciliation).
- Check if a key rotation caused bindings to become unresolvable, triggering mass re-reconciliation.
- Review external API usage for unusual patterns that might be driving reconciliation requests.
- If the spike appears malicious, consider rate-limiting at the load balancer or temporarily disabling the affected provider.
Expired Session Accumulation
If expired sessions are accumulating in the database:
- Verify the cleanup job is running by checking application logs for cleanup execution messages.
- Check database connectivity -- the cleanup job requires write access to delete expired sessions.
- Verify the
session-cleanup.interval-minutesconfiguration has not been accidentally set to an extremely high value. - Check for database lock contention that might prevent the DELETE operations from completing.
- As a last resort, manually delete expired sessions with a direct SQL query (after verifying that all expired sessions have passed their TTL).
Key Migration Stalled
If a key migration stops progressing:
- Check the
key_migration_historytable for the currentstatusandrecords_failedcount. - Review application logs for migration-related error messages (search for the migration version number).
- Verify that both the old and new keys are accessible in the KMS. A common cause of stalled migrations is that the old key was decommissioned before the migration completed.
- Check database connection pool availability. The migration process uses connections from the same pool as normal operations, so a heavily loaded system may starve the migration of connections.
- If the migration is stuck on specific records, investigate those records individually. They may have corrupted data or reference a key version that no longer exists in the KMS.