SFSENFORGEENGINEERING
← Engineering Journal
System Design

Designing Distributed Systems for Teams, Not Just Traffic

Most distributed systems writing focuses on scale: throughput, latency, fault tolerance. These matter. But the systems that actually fail in production usually fail because they were too complex for the team operating them. Here is how we think about this.

2025-04-02
16 min
SenForge Engineering
Share

There is a version of distributed systems thinking that treats human beings as a constraint to be optimised around. More services means more independent deployments. More independent deployments means more team autonomy. Therefore: more services.

The logic is not wrong. It is incomplete. The part it skips is the operational complexity that lands on the team when 23 services each have their own deployment pipeline, their own logging format, their own alerting thresholds, and their own database migration strategy.

The Cognitive Load Budget

Every service in a distributed system has an operational surface area: deployment, configuration, observability, failure modes, and dependencies. A team of 8 engineers can fully understand and operate perhaps 4-6 production services before cognitive load starts degrading incident response time and increasing change failure rate.

The right number of services for a system is not determined by the traffic it handles. It is determined by the team that has to wake up at 3am when one of them fails.

Service Boundaries Should Follow Conway's Law Deliberately

Conway's Law states that systems tend to mirror the communication structure of the organisations that build them. Most teams treat this as a warning. We treat it as a design tool.

If your team has a clear ownership boundary between two domains — say, billing and fulfilment — a service boundary there is defensible. If your team has no such boundary, a service boundary there creates coordination overhead with no compensating autonomy benefit. You have distributed the system without distributing the team.

Operational Standardisation as a Force Multiplier

The teams that operate distributed systems well have invested heavily in standardisation. Not in the services themselves — each service is different by necessity — but in the operational layer:

A single structured logging format across all services

Standardised health check and readiness endpoints

A single observability platform with consistent metric naming conventions

Runbooks co-located with the service code, updated with every incident

When an engineer unfamiliar with a service has to diagnose it at 2am, standardisation is the difference between a 20-minute resolution and a 4-hour incident.

The Modular Monolith as an Intermediate Step

For teams of under 15 engineers building a greenfield system, we frequently recommend starting with a well-structured modular monolith. Clear module boundaries with enforced package visibility, explicit interfaces between domains, a single deployment unit. When a genuine scalability or team-autonomy bottleneck appears, extracting a service from a well-bounded module is straightforward. Extracting a service from a tangled monolith is a multi-month project.

Design the system for the team you have. Build in the ability to evolve it for the team you will become.