Performance engineering as a habit, not a project
Most performance work is reactive: the bill goes up, the dashboard turns red, somebody opens an investigation ticket. Here's how to make it boring instead.
Most performance work in the industry happens in one of two modes.
The first is heroics. Somebody — usually a tenured engineer with a
reputation for caring about this stuff — disappears for three weeks
with a stack of profilers, comes back with a deck full of flamegraphs,
deletes a toString() somewhere, drops the cloud bill by 18%, and is
celebrated at the next all-hands. Within four months the bill is back
where it was. The engineer is now annoyed. Nobody can pinpoint exactly
when the regression came back. It just sort of… did.
The second is reactive. A latency dashboard turns red. PagerDuty
goes off. A small task force is convened in a war room called
something like #perf-tigers-q3. The fires get put out. The task
force dissolves. Within four months the dashboard turns red again,
sometimes for the same reason.
Both modes work, sort of. They also share a fatal property: they treat performance as a project. Projects end. The regression doesn’t care that the project ended. It just patiently waits for the next quarter and starts climbing again.
The teams I’ve seen ship genuinely fast systems do something else. They don’t have a “performance initiative” with a project lead and an OKR. They have a habit. The habit is boring. The boringness is the point.
The habit, in three rules
If I had to compress eight years of doing this to a single page:
- Measure on every PR. No exceptions, no opt-outs, no “we’ll add benchmarks once the feature stabilises.”
- Compare against yesterday, not against an SLO. SLOs catch fires. Diffs catch the people lighting the matches.
- Make regression a build failure, not a Slack ping somebody gets around to reading on Friday.
That’s it. That’s most of the trick. Everything below is plumbing, diff-formatting, and politics. Particularly the politics.
Why measuring every PR matters more than you think
Almost no production performance regression comes from a single bad commit. They come from twenty 0.6% regressions, none of which is big enough to be worth arguing about in code review, stacked over a quarter. By the time anyone notices, the original cause is buried in a git log nobody is going to bisect.
The defence is a benchmark suite that runs on every PR. Same shape as your test suite. Not a separate thing. Not a “performance team” thing. Just part of CI.
It doesn’t have to be elaborate. It has to be:
- Reliable. Hermetic runner, pinned hardware, no noisy neighbours. If your CI runs on shared compute in a hyperscaler, do the perf runs on a dedicated machine somewhere quiet. I’ve used a literal Mac Mini under a desk for this. It worked great.
- Statistically honest. One run is not a benchmark, it’s a coin
flip. I aim for at least eight runs with
benchstat(Go) orcriterion(Rust) orpytest-benchmark(Python) doing the statistical heavy lifting. - Boringly visible. The diff has to show up in the PR. Slack doesn’t count. Email doesn’t count. Anything that requires a human to go look at it will, over a long enough timeline, be forgotten.
For a Go service, the bones of it look like:
package bench
import (
"context"
"testing"
)
func BenchmarkCreateOrder(b *testing.B) {
srv := newTestServer(b)
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
if _, err := srv.CreateOrder(context.Background(), validOrder()); err != nil {
b.Fatal(err)
}
}
}
Five lines, plus setup. Run it on every PR. Store the results. That’s it.
Compare against yesterday, not against an SLO
SLOs are great. SLOs catch fires, page on-call, force prioritization, all the things they’re supposed to do.
SLOs are also completely useless at catching the kind of regression I’m describing. A 4% bump in p99 latency does not violate any reasonable SLO. It also, over fifty deploys, more than doubles your tail latency. You will not catch this with thresholds. You will always catch it with diffs.
The single most useful artefact in a perf-focused team’s day is not a Grafana dashboard. It’s a comment in a PR that looks like this:
benchmark before after delta
---------------------------------------------------------
BenchmarkCreateOrder 412 µs 438 µs +6.3% *
BenchmarkListOrders 91 µs 93 µs +2.2%
BenchmarkAuth 18 µs 19 µs +5.6% *
BenchmarkSerializeOrder 7.2 µs 7.3 µs +1.4%
* indicates statistically significant (n=8, p<0.01)
Posted by a bot. Before human review. Every PR.
The point of this is not really the numbers. The point is social. Once “+6.3%” is sitting in the review thread, the conversation changes. The author justifies it (“we added a required signature check, this is expected”), or they back it out, or they file a follow-up. Either way: the regression is seen. Seen regressions get fixed. Invisible ones don’t.
This is, deeply, a sociotechnical fix wearing a technical hat. The tooling is the easy part.
Make regression a build failure
The first time you turn this on, CI will be on fire for a week.
This is good. You are paying down debt you’d otherwise pay in production, in a smaller and more controlled way. The week sucks. The quarters that follow are noticeably calmer.
A reasonable opening policy (yours should be more strict over time):
- Block PRs with a >10% regression on hot-path benchmarks.
- Warn but don’t block for 3–10%.
- Allow opt-out — with a written justification in the PR description.
That last bit is the important one. “Migrating to a new crypto library, +14% but mandated for FIPS, follow-up tracked at JIRA-1234” is a perfectly fine justification. “Refactor, will fix later” is exactly how regressions get accepted, and it’s the thing the policy exists to stop. The justification doesn’t have to be approved by anyone, it just has to exist. The act of writing it forces the author to think about whether the regression is actually fine.
A skeleton GitHub Actions job:
name: perf
on: [pull_request]
jobs:
bench:
runs-on: self-hosted-perf # pinned hardware, not a hyperscaler runner
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 2 }
- name: Bench base
run: |
git checkout $
go test -bench=. -benchmem -count=8 ./... | tee /tmp/base.txt
- name: Bench head
run: |
git checkout $
go test -bench=. -benchmem -count=8 ./... | tee /tmp/head.txt
- name: Compare and enforce
run: |
benchstat /tmp/base.txt /tmp/head.txt | tee /tmp/diff.txt
./scripts/enforce-budget.py /tmp/diff.txt --max-regression 0.10
You can buy fancier versions of this off the shelf. They are not, in my experience, better. The simple version above has shipped to production three times for me and worked all three times.
The part that’s actually hard
Tools are 20% of this. The other 80% is the team agreeing, out loud, that performance is a feature. Not a nice-to-have. Not a “non-functional requirement”, which is a phrase invented by people who didn’t want to do the work. A feature. Same priority as “can the user log in”.
I will spare you the inspirational paragraph. Here are the small rituals that, in practice, calcify the habit:
- A 30-minute weekly perf review. Fixed dashboard. What got faster this week, what got slower, what’s the spend trend. Cancel it exactly zero times in the first three months even if there’s “nothing to discuss”.
- Performance budgets in design docs. “This endpoint must serve p99 < 80ms at 1000 RPS on a standard pod” should be a sentence in the doc before any code gets written. If it isn’t, the doc isn’t done.
- Profile every release. Save the flamegraph. Diff against last release. You’d be amazed what shows up.
The last one is my favourite, and the one that gets dropped first. Don’t drop it.
When you know it worked
The signal that the habit has taken hold is that nobody mentions performance anymore. Bench diffs are part of every review the way green checkmarks are. Regressions get reverted the same day. The CFO stops being surprised by the cloud bill, which means they stop asking about it, which means you get to spend your one-on-ones on something else.
Performance, done well, is invisible. Performance, done badly, is a permanent low-grade crisis with a different name every quarter. Pick one.