75ms or Bust: Accelerating Development Velocity Through Hard Performance Limits
By Graham Goudeau, CTO — September 5, 2025
# BLUF
At aethren, all web apps must produce a response in 75 milliseconds, or we intentionally fail the request with an HTTP 500. This entirely self-imposed restriction has had the surprising side effect of boosting development velocity: with only 75ms to respond, there simply isn't time on the request path to do anything complicated.
In this post we will:
- Describe the technical details of our approach,
- Illustrate the second-order effect it has had on development velocity, and then
- Show how other engineering orgs can protect user experiences by adopting this practice
# Imposing global response timeouts
## Why 75 milliseconds?
Our choice of 75 milliseconds is more vibes-based than science-based. Our goal was to choose an ambitious target (though we recognize that 75ms is not ambitious in all programming contexts), while still allowing flexibility to handle occasional random ~10s of ms slowdowns from our underlying infra.
With our hard limit of 75 milliseconds spent in the server, we can ensure high-quality interaction time for users across the United States even before engaging in difficult multi-region deployment work.
## Are there any exceptions?
In general, no. We have a firm expectation that any web endpoint handler will complete its work within 75 milliseconds.
We do recognize, however, that exceptional circumstances will arise that necessitate choosing a degraded user experience over no user experience at all. For example, imagine a disaster scenario where our persistence layer suddenly takes 100ms to respond due to a regression in an upstream provider. In incident scenarios, we are able to turn off the 75ms limit at runtime with a feature flag.
## How do you “shift left” on hitting timing violations?
Outside of production, our persistence layer is:
- In local dev: a JSON file in
/tmp
- In unit tests, which exercise our HTTP timeout middleware: an in-memory fake
In either case, we artificially inject many-millisecond delays when reading or writing. This ensures that we're simulating network hiccups as often as possible instead of getting surprised at the last minute in production.
## How does the code work?
Our tech stack at aethren is dead simple; as of writing, GitHub's language report of our monorepo reports:
- 95.5% Go — produces server-side-rendered HTMX
- 1.8% HTML — a small number of hard-coded pages
- 1.3% JavaScript — extremely limited client-side interactivity not achievable with plain HTMX
Our web apps exclusively register their endpoint handlers through Go's excellent net/http
from the standard library:
package someroute
import (
"context"
"html/template"
"net/http"
"aethren.com/monorepo/lang/aethrenhttp"
"aethren.com/monorepo/lang/middleware"
)
func SetupHandlers() {
http.HandleFunc("/some-unauthenticated-route", middleware.HandlerPlainHttpUnauthenticated(func(ctx context.Context, r *http.Request) aethrenhttp.Response {
// do some work and return a response
}))
}
Our middleware stack enables us to inject global response timeouts across the entire web app:
package middleware
import (
"context"
"net/http"
"time"
)
const timeout = 75 * time.Millisecond
func timeoutMiddleware(next http.Handler) http.Handler {
return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), timeout)
defer cancel()
r = r.WithContext(ctx)
done := make(chan struct{})
go func() {
next.ServeHTTP(w, r)
close(done)
}()
select {
case <-ctx.Done():
http.Error(w, "request exceeded 75ms deadline", http.StatusInternalServerError)
case <-done:
}
})
}
# Second-order positive impacts on developer velocity
With this global response time limitation in place, all web development at aethren begins with the assumption that we WILL be returning a response in 75ms, or our feature won't work. This means that all design work must start from that constraint, and then work backwards from there.
With only 75ms, complexity becomes self-limiting. Tall call stacks, heavy abstractions, or slow libraries simply don't fit, so they fall away by design. We must design our persistence strategy with the goal of data retrieval being possible well within 75ms. We have natural incentives to avoid doing heavy and complicated work on synchronous hot paths, and to instead smartly defer that work to “offline” asynchronous jobs.
Despite this sounding complicated, it's been quick to get into this mindset and get comfortable living there. We've been forced to build ourselves high-quality persistence and offline-job primitives that reliably accomplish our performance goals, and then reuse those primitives everywhere. The end result has been a development process that ships quickly and doesn't spend lots of time building complicated and expensive towers of abstractions.
This policy reflects a broader mentality that has proven successful for us:
Explicitly encode your assumptions, and then fail loudly and quickly if you stray from the golden path.
At our early stage, this mentality is critical for us to move fast while not drowning in ambiguity and complexity.
# Call to action: Introduce your own per-endpoint response time ratchets
While aethren has been able to adopt this practice while we're in an early stage, we recognize that most teams will be managing established applications that have grown organically over many years. Despite this difference in context, it is possible to begin protecting your user experiences TODAY through a very similar methodology.
Our recommended approach is:
- Identify one endpoint where the end-user latency is critical
- Introduce an extremely generous response time limit; e.g. if your page typically loads in 1.5s, set a 3s timeout
- Slowly ratchet-down that limit over time; e.g. a 6-month plan to reduce from 3s to 1s
- Treat timeout violations as symptoms, not root causes. Don't increase your budget; fix the regression
While this process can be difficult, the end result is meaningful improvements in latency and a hard defensive line against performance regressions.