Skip to main content

Encrypted Storage Patterns

The portal implements a strict zero-plaintext-at-rest policy. Every sensitive field in the database is either HMAC-hashed (irreversible, for lookup) or AES-256-GCM encrypted (reversible, for attribute retrieval). The encryption and decryption happen transparently in the service layer -- HTTP endpoints and reconciliation orchestrators work with plaintext objects, while the persistence layer never sees or stores plaintext.

This page describes the encryption patterns used throughout the portal, the data structures involved, and the guarantees they provide.

Encrypt-on-Write, Decrypt-on-Read

The service layer acts as the encryption boundary. All encryption and decryption operations occur at this layer, creating a clean separation between application logic (which works with plaintext) and storage (which only ever contains ciphertext).

[HTTP Endpoint] → plaintext → [Service Layer] → encrypt(plaintext, Key_C) → [Store/DB]
[HTTP Endpoint] ← plaintext ← [Service Layer] ← decrypt(ciphertext, Key_C) ← [Store/DB]

Application code above the service layer never needs to think about encryption. Controllers, orchestrators, and business logic work exclusively with plaintext domain objects. Application code below the service layer (SQL queries, repository implementations, database drivers) never sees plaintext sensitive data. This clean separation provides several important benefits:

  • Automatic coverage for new features: Any new feature that reads or writes sensitive data through the service layer automatically gets encryption. Developers do not need to remember to add encryption calls -- the service layer handles it transparently.

  • Secure database backups: Database dumps and backups contain only encrypted data. A backup file that falls into the wrong hands reveals nothing without KMS access.

  • DBA isolation: Database administrators can perform their duties (monitoring, query optimization, schema migrations) without ever having access to sensitive content. They see ciphertext in their query results, which is operationally sufficient for performance analysis.

  • Safe read replicas: Read replicas used for analytics or reporting contain only encrypted data, reducing the security surface of the replication infrastructure.

Persisted Attributes Envelope

The persisted attributes envelope is the primary encrypted data structure in the portal. It contains the canonical claims selected by attribute rules that have persist: true configured in the tenant's attribute mapping.

What Goes Inside

The envelope contains the subset of user attributes that the portal needs to persist for fast-path token projection. A typical envelope might contain:

{
"eduid": "urn:mace:surf.nl:eduid:12345",
"eduperson_principal_name": "student@institution.nl",
"email": "student@institution.nl",
"schemaVersion": "2026-03-24"
}

The schemaVersion field tracks the attribute mapping version that was used to produce this envelope. If the tenant's attribute rules change, the portal can detect that the persisted envelope was produced under an older schema and trigger a re-reconciliation to refresh the attributes.

How It Is Stored

The storage process works as follows:

  1. The JSON object is serialized to a UTF-8 string.
  2. The string is encrypted with AES-256-GCM using Key C, producing an output consisting of: IV (12 bytes) + ciphertext + authentication tag (16 bytes).
  3. The encrypted output is encoded and stored as a TEXT column in identity_link_binding.persisted_attributes_envelope.

On read, the process is reversed: the TEXT value is decoded, decrypted with Key C (which also verifies the authentication tag), and deserialized back into a JSON object.

Data Minimization

It is important to note that attributes with persist: false in the tenant's attribute rules are not included in the envelope. For example, if the default configuration marks given_name and family_name as persist: false, those attributes are projected into the STS token during the current active session (because the OIDC provider's response is still in memory) but are never written to the database.

This implements the principle of data minimization: only the minimum necessary data is persisted. Attributes that are needed only during the live session (such as display names for the UI) are never stored, reducing both the security risk and the GDPR compliance burden. If those attributes are needed in a future session, the user must re-authenticate with their institution, which fetches fresh values from the OIDC provider.

Auxiliary Data Encryption

The AuxiliaryDataService provides encrypted storage for institution-specific metadata that does not fit into the standard identity link model. This includes information such as enrollment status, academic role, programme of study, or any other institution-specific attributes that need to be stored alongside the identity link.

interface AuxiliaryDataService {
suspend fun store(
tenantId: String,
internalIdentityId: String,
category: String,
data: Map<String, JsonElement>,
expiresAt: Instant? = null,
): AuxiliaryDataRecord

suspend fun getDecrypted(
tenantId: String,
internalIdentityId: String,
category: String? = null,
): List<DecryptedAuxiliaryData>

suspend fun deleteAll(tenantId: String, internalIdentityId: String): Int
suspend fun delete(tenantId: String, id: String): Boolean
suspend fun findExpired(tenantId: String, cutoff: Instant): List<AuxiliaryDataRecord>
}

How It Works

The store() method accepts plaintext data as a Map<String, JsonElement>, encrypts it with Key C, and persists the ciphertext to the auxiliary_data table. The caller never needs to handle encryption directly.

The getDecrypted() method retrieves records from the database and decrypts them on the fly. It returns DecryptedAuxiliaryData objects containing the original plaintext map. It never returns or exposes raw ciphertext to callers. If the optional category parameter is provided, only records of that category are returned.

Categories and TTL

Categories allow organizing auxiliary data into logical groups. For example, a tenant might store enrollment data under the "enrollment" category, role information under "role", and programme details under "programme". Each category is stored as a separate record, enabling fine-grained retrieval and deletion.

The optional expiresAt parameter enables automatic expiration of auxiliary data. The findExpired() method returns records that have passed their expiration time, and a background cleanup process periodically removes them. This is useful for data that is inherently time-limited, such as a temporary enrollment status or a session-scoped authorization.

What Gets Encrypted Where

The following table provides a complete inventory of encrypted fields in the database:

DataTable.ColumnKeyPurpose
Institution-scoped eduIDidentity_link_binding.encrypted_institution_idKey CRecoverable institutional identifier for token issuance
Canonical claimsidentity_link_binding.persisted_attributes_envelopeKey CCached attributes for fast-path token projection
Resolved identityreconciliation_session.encrypted_identityKey CTemporary session data during IDV flow
Auxiliary payloadauxiliary_data.encrypted_payloadKey CInstitution-specific metadata (enrollment, grades, etc.)

All four fields use the same key (Key C) and the same algorithm (AES-256-GCM), but each encryption operation uses a unique random IV, so identical plaintext values stored in different fields or different rows produce different ciphertext.

Zero-Plaintext Database Guarantee

The combination of HMAC hashing (Keys A and B) and AES-256-GCM encryption (Key C) ensures that the database contains zero plaintext sensitive data. Here is exactly what an attacker would see with full database read access but no access to the KMS:

ColumnWhat Is StoredWhat the Attacker Sees
identifier_hashHMAC(identifier, Key_A or Key_B)Opaque hash string with no discernible pattern
holder_identifier_hashHMAC(holder_key, Key_A)Opaque hash string
institution_identifier_hashHMAC(institution_id, Key_B)Opaque hash string
encrypted_institution_idAES-GCM(eduid, Key_C)Opaque ciphertext blob
persisted_attributes_envelopeAES-GCM(claims_json, Key_C)Opaque ciphertext blob
encrypted_payloadAES-GCM(aux_data, Key_C)Opaque ciphertext blob
subject_hash (audit)HMAC(subject_id, Key_A or Key_B)Opaque hash string

Non-sensitive metadata such as timestamps (created_at, updated_at), status enums (ACTIVE, PENDING), configuration IDs, and tenant IDs is stored in plaintext. This metadata is necessary for database operations (indexing, querying, cleanup) and does not identify individuals. The determination of what constitutes "sensitive" vs. "non-sensitive" is made based on whether the data could be used to identify, correlate, or profile an individual.

What the Attacker Cannot Do

Even with full database access, an attacker without KMS access cannot:

  • Determine which wallet holder is linked to which institutional account
  • Recover any institutional identifier (eduid, subject ID, email)
  • Read any cached user attributes (name, email, affiliation)
  • Read any auxiliary data (enrollment status, programme, role)
  • Correlate database records with external systems
  • Determine whether two hash values in different tables refer to the same entity (because different keys produce different hashes for the same input)

What the Attacker Can See

The attacker can observe:

  • How many identity links exist (row count)
  • When they were created and last updated (timestamps)
  • Their status (active, pending, etc.)
  • Which tenant they belong to (tenant ID)
  • The structure of the data (table schema)

This metadata is considered acceptable exposure because it does not reveal personal data or enable identification of individuals.

Crypto-Shredding for GDPR

An elegant property of this architecture is that destroying Key C makes all AES-256-GCM encrypted data permanently and irreversibly irrecoverable. Combined with deleting the HMAC hashes from the database, this achieves complete data erasure in O(1) time -- regardless of how many records exist. No need to find and delete individual records across multiple tables and backups; simply destroying the encryption key renders all encrypted data meaningless.

In practice, GDPR erasure for a single user uses per-record deletion through the external API's DELETE endpoint, which removes specific rows from the database. This is the normal path for individual right-to-erasure requests.

However, crypto-shredding provides a nuclear option for catastrophic scenarios:

  • Complete tenant decommissioning: If a tenant (institution) leaves the portal, destroying the encryption key immediately renders all their encrypted data irrecoverable, even in database backups that have not yet been purged.
  • Catastrophic breach response: If the database is known to have been exfiltrated, destroying the encryption key ensures that the stolen data is permanently unreadable.
  • End-of-life data destruction: When the portal is decommissioned, destroying the encryption keys provides a verifiable guarantee that no encrypted data can ever be recovered.

It is worth noting that crypto-shredding only affects data encrypted with Key C. HMAC hashes (produced by Keys A and B) are already irreversible -- they cannot be "decrypted" regardless of whether the HMAC keys exist. However, the HMAC keys should also be destroyed during full decommissioning to prevent an attacker from computing hashes of known identifiers and matching them against a stolen database.

Encryption in Reconciliation Sessions

During the identity verification (IDV) reconciliation flow, the user authenticates with their institutional identity provider, and the OIDC provider returns a set of claims (name, email, institutional ID, etc.). These claims need to be temporarily stored while the reconciliation process completes, which may involve additional steps such as user consent or attribute selection.

The resolved identity (the full set of claims from SURFconext) is encrypted with Key C before being stored in the reconciliation_session.encrypted_identity column. This ensures that even temporary session data is protected at rest.

Reconciliation sessions have a short time-to-live (TTL) of 5 minutes. After the TTL expires, a background cleanup process deletes expired sessions. However, the encryption ensures that the data is protected regardless of timing:

  • If the cleanup process runs on schedule, the encrypted data is deleted promptly.
  • If the cleanup process is delayed (due to load, errors, or maintenance), the data remains encrypted and unreadable without KMS access.
  • If the database is backed up during the session's brief lifetime, the backup contains only ciphertext.

This belt-and-suspenders approach (short TTL combined with encryption) reflects the principle that security mechanisms should not depend on a single control working perfectly. The TTL minimizes the window of exposure, and the encryption ensures that exposure during that window is harmless.