Privacy Architecture
The eduID Wallet Matching Portal is designed around a single foundational principle: the database should be useless to an attacker. Even with complete, unrestricted access to every table, every row, and every column in the database, an attacker would find only HMAC hashes and AES-256-GCM ciphertext. No plaintext identifiers, no cleartext names, no recoverable email addresses. This is not an add-on security layer bolted onto a conventional application; it is the fundamental architecture of the system, built in from the first line of schema design.
This page describes the privacy-by-design principles that govern the system, explains how each principle is implemented, and provides a comprehensive threat model analysis.
Core Principle: No Plaintext Identifiers
Every external identifier that enters the system is cryptographically transformed before it touches the persistence layer. The transformation happens in the service layer (Kotlin application code), before data reaches any database query or storage operation.
- Wallet holder keys are HMAC-hashed with Key A before any database operation. The hash is used for lookup and matching; the original key value is never stored.
- Institutional identifiers (eduID, OIDC subject IDs, eduPerson principal names) are HMAC-hashed with Key B before storage. This prevents an attacker from correlating database records with known institutional identities.
- Sensitive attributes (the actual institution ID, canonical identity claims) are encrypted with AES-256-GCM using Key C. Only the encrypted ciphertext is stored; the plaintext is reconstructed only when needed for token issuance, and only in memory.
The following table illustrates what an attacker with full database access would actually see:
| Column | Stored Value | Can Attacker Recover Original? |
|---|---|---|
identifier_hash | zQmXk7f9a2b... (HMAC-SHA256) | No -- requires Key A or Key B, held in KMS |
holder_identifier_hash | zQm8hJ3kL9... (HMAC-SHA256) | No -- requires Key A |
institution_identifier_hash | zQmP4nR7wQ... (HMAC-SHA256) | No -- requires Key B |
encrypted_institution_id | AES-GCM ciphertext blob | No -- requires Key C |
persisted_attributes_envelope | AES-GCM ciphertext blob | No -- requires Key C |
encrypted_payload (auxiliary) | AES-GCM ciphertext blob | No -- requires Key C |
subject_hash (audit) | zQmY9kM2nP... (HMAC-SHA256) | No -- requires Key A or Key B |
tenant_id | kw1c (plaintext) | Yes -- but this is not PII |
status | COMPLETED (plaintext) | Yes -- but this is not PII |
created_at | 2026-03-27T10:00:00Z (plaintext) | Yes -- but this is not PII |
Non-sensitive metadata -- timestamps, status enumerations, configuration references, tenant identifiers, version numbers -- is stored in plaintext because it does not identify individuals. The system draws a clear line between operational metadata (safe to store in the clear) and identity data (always hashed or encrypted).
Domain-Separated Keys
The system uses three cryptographic keys, each serving a distinct and non-overlapping purpose. This domain separation means that the compromise of any single key does not cascade to data protected by the other keys. An attacker must compromise multiple independent keys simultaneously to achieve meaningful data access.
| Key | Algorithm | Purpose | Data Protected |
|---|---|---|---|
| Key A | HMAC-SHA256 | Hashing wallet/holder identifiers | identity_match.identifier_hash (holder type), identity_link_binding.holder_identifier_hash |
| Key B | HMAC-SHA256 | Hashing institution identifiers | identity_match.identifier_hash (institution type), identity_link_binding.institution_identifier_hash |
| Key C | AES-256-GCM | Encrypting sensitive data at rest | identity_link_binding.encrypted_institution_id, identity_link_binding.persisted_attributes_envelope, reconciliation_session.encrypted_identity, auxiliary_data.encrypted_payload |
The following table analyzes every possible key compromise scenario and its impact:
| Scenario | Impact | What Remains Protected |
|---|---|---|
| Key A compromised alone | Attacker can compute holder key hashes and identify which rows belong to which wallet holder | Cannot reverse institution hashes (Key B). Cannot decrypt any stored data (Key C). Cannot determine which institution a holder is linked to. |
| Key B compromised alone | Attacker can compute institution ID hashes and identify which rows reference which institution | Cannot reverse holder key hashes (Key A). Cannot decrypt any stored data (Key C). Cannot determine which wallet holder is linked to which institution. |
| Key C compromised alone | Attacker can decrypt encrypted attributes and institution IDs | Cannot correlate the decrypted data with any specific wallet holder (Key A) or identify which hash corresponds to which institution (Key B). The decrypted data is orphaned. |
| Key A + Key B compromised | Attacker can correlate holder keys with institution IDs through hash comparison | Still cannot decrypt any data (Key C). Can build a mapping of "wallet X is linked to institution Y" but cannot see what attributes are stored. |
| Key A + Key C compromised | Attacker can identify holders and decrypt their stored data | Cannot identify which institution the data belongs to (Key B). Knows "wallet X has these attributes" but not "at institution Y". |
| Key B + Key C compromised | Attacker can identify institutions and decrypt stored data | Cannot identify which wallet holder the data belongs to (Key A). Knows "institution Y has these attributes" but not "for wallet X". |
| All three keys + database | Complete data access, full correlation | Requires simultaneous compromise of both the database AND the KMS -- two separate systems with independent access controls, authentication, and audit trails. |
| Database breach only (no keys) | Attacker has hashes and ciphertext | All three keys are in the KMS, not in the database. The data is cryptographically inert without the keys. |
The defense-in-depth design means an attacker needs to compromise multiple independent systems to gain meaningful access. Each additional system the attacker must breach dramatically increases the difficulty and the likelihood of detection.
Tenant Isolation
Every table that holds tenant-specific data includes a tenant_id column. Every query that accesses tenant-specific data filters on tenant_id. Every HMAC computation includes the tenant context as part of the hash input. This comprehensive tenant isolation means:
- Hash isolation: A hash computed for Tenant A cannot match a record belonging to Tenant B, even if the underlying identifier is identical. The tenant ID is mixed into the HMAC computation, producing different hashes for the same input across tenants.
- Query isolation: Every named SqlDelight query that operates on tenant data requires a
tenant_idparameter. There are no "superuser" queries that bypass tenant filtering. It is architecturally impossible to accidentally query across tenants. - Correlation prevention: Even with full database access, an attacker cannot determine whether two identifiers from different tenants represent the same person. The hash values will be different due to the tenant-scoped HMAC computation.
- Cryptographic silos: Each tenant effectively operates in its own cryptographic silo. While the same KMS keys are used across tenants (the key material is shared), the tenant-scoped hashing means that cross-tenant correlation is impossible at the data level.
Data Minimization
The canonical attribute rules system enforces data minimization at the configuration level, implementing the GDPR principle that personal data processing should be limited to what is necessary for the specified purpose.
Each attribute in the system has a persist flag:
persist: true: The attribute is encrypted and stored in the binding'spersisted_attributes_envelope. These are attributes necessary for ongoing identity resolution and token issuance (e.g.,eduid,eduperson_principal_name,schac_home_organization).persist: false: The attribute is used transiently during the current session -- it may flow through the system and appear in JWT tokens issued by the STS -- but it is never written to the database. These are attributes that are useful for the user experience but not necessary for the core identity linking function (e.g.,given_name,family_name,email).
This means that names and other directly identifying PII are ephemeral. They exist only in memory during an active session and in the JWT token for its lifetime (typically 1 hour). After the session ends and the token expires, these values are gone. They are not stored in the database, not in the encrypted binding envelope, and not recoverable by any means.
The default configuration persists only the minimum necessary for identity resolution: the institutional identifier (eduid), the scoped principal name, affiliation data, and the home organization. Human-readable names are intentionally excluded from persistence, dramatically reducing the value of a database breach to an attacker.
Encryption at Rest
All sensitive attributes stored in the database are encrypted with AES-256-GCM (Galois/Counter Mode). This encryption algorithm provides two guarantees:
- Confidentiality: The data is unreadable without the encryption key (Key C). The ciphertext reveals nothing about the plaintext.
- Authenticity: Any tampering with the ciphertext is detected on decryption. AES-GCM produces an authentication tag that validates the integrity of both the ciphertext and any associated data. If an attacker modifies even a single bit of the ciphertext, decryption will fail.
Each encryption operation uses a unique random nonce (number used once), preventing ciphertext analysis attacks. Even if the same plaintext is encrypted twice, the resulting ciphertext will be different because a new random nonce is generated for each operation.
The following columns contain AES-256-GCM encrypted data:
| Table | Column | Contents When Decrypted |
|---|---|---|
identity_link_binding | encrypted_institution_id | The institution identifier (e.g., eduID) in plaintext |
identity_link_binding | persisted_attributes_envelope | JSON object containing all persist: true canonical attributes |
reconciliation_session | encrypted_identity | The resolved identity data from the OIDC provider |
auxiliary_data | encrypted_payload | Arbitrary supplementary data (enrollment info, organizational metadata) |
Crypto-Shredding
Crypto-shredding is the ultimate data destruction mechanism enabled by the encryption architecture. Because all sensitive data is encrypted with Key C, destroying that key in the KMS makes all encrypted data permanently and irrecoverably unreadable. Combined with the fact that HMAC hashes are one-way functions, destroying all three keys renders the entire database contents meaningless.
This property provides several important capabilities:
- O(1) data destruction: Rather than scanning and deleting millions of individual records, a single key deletion operation in the KMS destroys all encrypted data instantly.
- Provable destruction: The KMS provides audit logs proving that the key was destroyed at a specific time. This is stronger evidence of data destruction than database record deletion, which could potentially be recovered from backups or transaction logs.
- Complete system decommissioning: When a deployment is being permanently retired, crypto-shredding provides a cryptographic guarantee that no residual data can be recovered from the database, backups, or any other storage.
In normal operation, per-record deletion (via the GDPR erasure API or automatic inactivity cleanup) is preferred because it is targeted and preserves other identities. Crypto-shredding is reserved for scenarios where blanket destruction is appropriate.
Consent by Design
The reconciliation process requires explicit user action at every step, implementing the GDPR principle of consent-based processing:
- Wallet presentation: The holder actively chooses to present their credential by scanning the QR code displayed in the portal. The system cannot initiate a wallet presentation without the holder's participation.
- IDV reconciliation: The user explicitly clicks through the reconciliation flow and authenticates at their institution. The system cannot link a wallet to an institution without the user completing this multi-step process.
- No silent tracking: There is no background correlation, no passive fingerprinting, no automatic identity linking. Every link between a wallet identity and an institutional identity is the result of a deliberate, conscious action by the holder.
An important privacy property of the reconciliation design is that SURF (the federation provider) receives no wallet-related data during the reconciliation flow. From SURF's perspective, the reconciliation is a standard OIDC login. SURF does not know that the login was triggered by a wallet authentication, and it receives no holder key, wallet credential, or binding information. This prevents the federation provider from building a correlation between wallet identities and institutional identities.
Privacy Flow Summary
The following diagram illustrates the privacy transformations at each stage of the system -- from wallet authentication through reconciliation, token issuance, and GDPR erasure:
Threat Model Summary
The following table provides a comprehensive analysis of attack scenarios and the mitigations the privacy architecture provides against each one.
| Attack Scenario | Mitigation |
|---|---|
| Database breach (external attacker) | Only HMAC hashes and AES-256-GCM ciphertext are stored. Keys reside in a separate KMS with independent access controls. The attacker gains no usable personal data. |
| Database breach (insider / DB admin) | Same protection as external breach. A database administrator sees only hashes and ciphertext. No amount of SQL access reveals plaintext identifiers. |
| KMS Key A compromise | Attacker can compute holder key hashes but cannot reverse Key B hashes (institution identifiers) or decrypt Key C data (encrypted attributes). Can identify wallet holders but not their institutional links. |
| KMS Key C compromise | Attacker can decrypt encrypted attributes but cannot correlate them with specific wallet holders (Key A) or specific institutions (Key B). The decrypted data is orphaned without the hash keys. |
| Full database + all 3 keys | Requires simultaneous compromise of the database server AND the KMS -- two separate systems with independent authentication, authorization, network controls, and audit trails. This is the highest-cost attack. |
| Cross-tenant correlation | Tenant ID is mixed into HMAC computation, producing different hashes for the same identifier across tenants. Even with full database access, cross-tenant identity correlation is cryptographically impossible. |
| SURF tracking wallet users | SURF receives no wallet or holder key data during reconciliation. From SURF's perspective, it processes a standard OIDC login. No correlation between wallet identity and institutional identity is possible at the SURF side. |
| Historical traffic analysis / correlation | The one-time nature of the reconciliation flow limits the correlation time window. Timestamps and operational metadata in the database are non-identifying. No persistent session cookies or tracking mechanisms exist. |
| Replay of reconciliation flow | PKCE (S256 challenge method) prevents authorization code interception. Single-use OIDC state parameters prevent session fixation. Nonces bound to ID tokens prevent token replay. |
| Session hijacking | HttpOnly, Secure, and SameSite cookie attributes prevent client-side script access and cross-site request inclusion. The BFF (Backend for Frontend) pattern ensures no tokens are accessible in the browser. |
| Authorization code interception | PKCE S256 is required on all authorization code flows. An intercepted authorization code is useless without the corresponding code verifier, which is held server-side. |