Secret Rotation Without an Outage
- Single-key rotation is impossible without downtime; the dual-key window is the cheapest fix.
- Secrets managers (Vault, AWS Secrets Manager, GCP Secret Manager) handle the storage; the application has to handle the dual-key acceptance.
- Long-lived static secrets are the technical debt; short-lived dynamic credentials are the eventual answer.
- Audit logs on secret access are non-optional; if you do not know who used the secret, you cannot rotate it confidently.
A team had a database password they had not rotated in three years. They knew they should. The last attempt had taken down their main application for 40 minutes because two services pulled the password at slightly different times: one got the old, one got the new, and the one with the old kept failing authentication until somebody manually restarted it. After that incident, the password got the unwritten label “do not touch”.
We did not solve their rotation problem by being more careful. We solved it by implementing the dual-key pattern: every consumer accepts both the old and new secret for a 24-hour transition window. The next rotation went through cleanly. The team rotated once per quarter from then on without incident.
This piece is about the patterns that make secret rotation a non-event. The dual-key window, the secrets manager, the audit trail, and the discipline that prevents the “do not touch” labels from accumulating.
Why single-key rotation breaks
Consider three services (A, B, C) sharing a database password:
- Time T0: All three services are using password V1.
- Time T1: You change the password to V2 in the database.
- Time T2: You update the secret in your secrets store to V2.
- Time T3: Service A pulls the new secret on its scheduled refresh, gets V2. Works.
- Time T4: Services B and C have not yet refreshed their secret cache. They still try V1. Fail.
The window between T1 and “all services have refreshed” is downtime. The window length depends on how often services refresh their secret cache (typically minutes to hours). For a busy service, that is a real outage.
Mitigations like “restart all services in a coordinated way” work but are operationally expensive and tend to be skipped under time pressure, which is when rotation usually happens.
The dual-key pattern
The fix: the database accepts BOTH the old password and the new password for a transition window. During the window, services that still hold the old password keep working; services that pull the new password also work. After all services have transitioned, the old password is revoked.
Step 1: Add new password V2 to the database (V1 still valid). Both V1 and V2 work.
Step 2: Update secrets store to V2.
Step 3: Wait for all services to refresh (typically 5-30 minutes, longer for safety).
Step 4: Revoke V1 in the database. Only V2 works.
The window in step 3 is the safety margin. If a service is slow to refresh, it still works because V1 is still accepted.
Most database systems support multiple active passwords: PostgreSQL with separate roles, MySQL with role grants, MongoDB with multiple users. Some require the application to support multiple credentials and try them in order; others handle it transparently at the database layer.
The same pattern applies to API keys, signing keys, encryption keys, JWT secrets. Wherever a secret is shared between a producer (the system that authenticates) and one or more consumers, dual-acceptance during transition is the safety mechanism.
What about secrets managers
A secrets manager (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault) provides:
- Centralised storage with access control
- Audit logging on every access
- Programmatic rotation with hooks
- Versioning so old secrets remain accessible during transition
The secrets manager does not eliminate the dual-key requirement; it makes operating it tractable. The application still needs to handle “I might receive an old or new secret depending on cache state”.
A typical setup:
[Database / API / signing system]
^
| reads/validates with V1 OR V2
|
[Application services]
^
| pull current secret on cache refresh
|
[Secrets manager]
- stores current version
- rotates on schedule
- logs every access
The rotation flow in this setup:
- Secrets manager generates new secret V2.
- Secrets manager updates the destination system to accept BOTH V1 and V2.
- Secrets manager updates the stored “current” to V2. Services pulling the secret now get V2.
- After the transition window, secrets manager updates destination to drop V1.
- Audit log captures every step with timestamp and identity.
For databases, AWS Secrets Manager has native rotation support that does this. For custom systems, the team writes a small rotation function that the manager calls.
Short-lived dynamic credentials: the eventual answer
The fully evolved version of secret management does not rotate static secrets at all. Instead, the secrets manager issues short-lived credentials per session.
A service that needs database access requests credentials from the vault. The vault generates a unique username and password, valid for 1 hour. The service uses the credentials. After 1 hour they expire automatically.
Benefits:
- No long-lived secret to leak. Even if compromised, the window is hours.
- Per-service or per-instance credentials enable precise audit (who exactly accessed what).
- Rotation is not a separate operation; it happens implicitly per session.
Costs:
- Engineering investment to wire credential request into every service.
- Vault becomes a critical dependency; its outage is a service-wide outage.
- Complexity that pays off at scale but rarely at small scale.
Most teams progress in stages: static secrets in env vars (legacy), to static secrets in a manager (modern baseline), to short-lived dynamic credentials (evolved). The pattern transitions follow what the team can operate.
Audit logs as the foundation
You cannot rotate a secret confidently if you do not know what is using it. Without audit logs, “rotate the database password” becomes “rotate the database password and find out which forgotten cron job breaks an hour later”.
The audit log should capture:
- Every secret read: when, by which service, by which identity
- Every secret write: when, by whom, what changed
- Every authentication attempt at the consumer: when, which credential, success or failure
Combined, these enable a confident rotation: “we are about to rotate this secret; the access log shows it is currently used by services A, B, C; we know to expect their auth attempts to migrate to the new secret over the next hour”.
Without the logs, every rotation is a leap of faith. With them, rotation is operational engineering.
Patterns that delay safe rotation
Secrets in source code. Even encrypted (sealed secrets, sops). The secret is now in version control history; rotating means rewriting history or accepting the leaked version is forever in the repo. Treat any committed secret as compromised; rotate it AND remove future writes from going through source control.
Shared static API keys across teams. Each team uses the same key. Rotation requires coordinating across teams. Solution: separate keys per team or per service so rotations are independent.
Long secret refresh intervals on the application side. Service caches a secret for 24 hours; rotation window has to be at least 24 hours. Either shorten the cache or accept the long window.
Manual secret distribution. Engineers copy secrets from notes apps. The secret exists in unknown locations. Rotation does not actually invalidate use; the engineer’s notes still have the old value and they may try to use it. Solution: secrets manager with no manual distribution.
What we install on engagements
For a team without a working secrets practice:
- Inventory secrets: every secret in active use, where it lives, who uses it, when last rotated.
- Pick a secrets manager: AWS / GCP / Azure native if you’re cloud-native, Vault for hybrid or larger deployments.
- Migrate secrets from env vars / source code into the manager.
- Add dual-acceptance to consumers: every service can authenticate with old or new secret during a transition window.
- Automate rotation for the most-touched secrets (database passwords, internal service tokens).
- Audit logging: every secret access logged, retained 90 days minimum.
- Policy: documented rotation cadence per secret class (monthly for high-sensitivity, quarterly default).
Total: 4-8 engineer-weeks for the first install, depending on service count. Pays back the first time a secret needs to be rotated under pressure (a leak, a departing engineer, a compliance requirement).
The teams that get this right rotate secrets as routine. The teams that do not have a “do not touch” list of secrets that grows quietly until the day rotation becomes mandatory at the worst possible moment.
Questions teams ask
How often should we rotate secrets?
Long-lived static secrets: at least quarterly, sometimes monthly for high-sensitivity. Dynamic credentials issued by a vault: hourly or per-session is reasonable. The right cadence is the shortest one the team can sustain without producing outages.
What about API keys to third parties?
Same pattern but harder because you do not control the third party. Dual-key works only if the third party allows multiple active keys. If they do not, schedule rotation for low-traffic windows and accept the risk.
Can secrets rotation be fully automated?
Yes for some classes (database passwords, internal service-to-service tokens). Partial for others (cloud provider keys with provisioning chains). Full automation requires the dual-key pattern in every consumer; partial automation is a stepping stone.