zero-trust, retrofitted: how to add real identity to a legacy estate
the practical path from flat-network banking-style infrastructure to identity-aware service-to-service calls, without a two-year rewrite.
pramod
co-founder
Zero-trust is a decade old as a term and middle-aged as an architecture, and most enterprises I walk into still have networks whose trust model can be summarised as "if the packet reached you, it is friendly." The aspiration to zero-trust is universal; the path there, for anyone with a ten-year-old estate, is uncharted enough that most teams simply postpone it forever. This essay is the path we have taken at arkavix four times in the last two years. It is unglamorous and it works.
start from identity, not from network
The single most common mistake in a zero-trust retrofit is to start from the network. Teams reach for a service mesh, a new VPC topology, and a list of CIDR blocks, and six months later they have a nicer network carrying the same implicit trust.
The correct starting point is identity. Before you can say "this service can talk to that one", you need a working notion of what a service is. Concretely: every workload needs a cryptographic identity, issued at provisioning time, rotated short-lived, and attestable by anything that receives a request from it. If your cluster issues a long-lived API key to every workload at deploy time — which most do — you have the network of a zero-trust system but the identity of a 2014 system, and the identity will decide the outcome.
The workload identity layer is the unglamorous one. Nobody wants to spend a quarter on it because nobody demos it. Spend the quarter.
the four-phase plan that has worked
phase 1: inventory
You cannot secure what you cannot name. The first two weeks of every retrofit are spent producing a service inventory with the following fields: name, owner, language runtime, deployment mechanism, current credentials in use, data it reads, data it writes, and the last time its dependencies were updated.
Most estates cannot produce this document. That is the finding. You produce it.
phase 2: identity
Roll out a workload identity layer. In practice: SPIRE, or your cloud provider's workload identity equivalent, with short-lived tokens — minutes, not days — signed by a CA your team controls. Make this identity layer available to every workload without requiring code changes; a sidecar or an init container is fine as long as it is universal.
The acceptance criterion for phase 2 is that every workload in your estate can, on demand, produce a verifiable identity. You are not yet using it to authorise anything. You are just ensuring the substrate exists.
phase 3: service-to-service mTLS
Now you begin using the identity. Turn on mutual TLS between services, starting with the highest-value ones. A service mesh can do this for you; so can a Envoy sidecar; so can a library if your language population is small enough.
Critically: run in permissive mode first. Every request is allowed; every request is logged with the client and server identities. For four to six weeks, you are producing a connection graph of your estate. Most teams are surprised by what they find — abandoned services making calls, services calling each other through a load balancer when they could call directly, services that claim to be decommissioned but are receiving traffic.
Once the graph stabilises, move to enforcing mode: requests without a valid client identity are rejected. You have now made impersonation architecturally difficult rather than conventionally unlikely.
phase 4: authorisation policy
The final phase is authorisation — the policy that says "identity A is allowed to call identity B for purpose C." This is where most implementations get elaborate. They should not.
The simplest authorisation policy that covers 90% of the blast-radius reduction is: every service declares, in its manifest, the set of services it is allowed to call. The service mesh enforces that set. That is it. No RBAC DSL, no policy language, no centralised authorisation service. The policy is declarative, version-controlled, and reviewed in the same pull request as the code that introduces a new dependency.
The remaining 10% — data-level authorisation, user-context propagation, delegation — is real work and should be done, but only after the easy 90% is in place. Do not let the hard part block the easy part.
what this buys you
At the end of four phases, typically six to nine months of focused work, you have:
- every workload identifiable at the network layer
- every service-to-service call carrying mTLS with short-lived credentials
- a declarative, reviewed allow-list of who can call whom
- an audit trail of every rejected call
- the ability to revoke a workload's access in seconds by revoking its identity
You do not have zero-trust end-to-end. You have the scaffolding on which every further control — data classification, user context, delegated authorisation — can be hung. The usual mistake is to try to hang all of them at once. The less usual but more fatal mistake is to try to hang any of them without the scaffolding.
the two things teams get wrong
The first is treating this as a security team project. A security-team-only zero-trust rollout becomes a parallel organisation inside engineering, and parallel organisations lose. The project has to be owned by platform engineering with security as a demanding customer. The deliverable is a paved road every product team would rather use than not.
The second is trying to finish. Zero-trust is not a state; it is a property of the system that has to be defended against every new shortcut. A retrofit that ends with "we are zero-trust now" has just begun; a retrofit that ends with "here is how the next ten services will inherit the property automatically" has actually delivered.