Back to Blog
Testing in production illustration
Testing Philosophy
February 5, 20269 min read

Testing in Production: It's Not as Crazy as It Sounds

Everyone tests in production. The question is whether you do it on purpose, with a plan, or by accident, with your users as the test suite.

Say “testing in production” at a software conference and watch the room divide. Half the audience nods knowingly. The other half looks at you like you just suggested performing surgery without washing your hands. It's one of the most polarizing phrases in our industry, and one of the most misunderstood.

The truth is, every company that ships software tests in production. The only difference is whether they do it intentionally, with safeguards and observability in place, or whether they do it accidentally, discovering bugs when a user files a support ticket at 2 AM. This post is an argument for doing it on purpose.

The Dirty Secret Everyone Knows

Here's the uncomfortable reality: your staging environment is a lie. It's a well-intentioned lie, a useful lie, but a lie nonetheless. Staging has a fraction of your production data. It runs on smaller machines. Nobody is hammering it with ten thousand concurrent requests. The third-party APIs it calls are sandboxed versions that behave differently from the real ones. The database has been seeded with test data that a developer created six months ago and hasn't been updated since.

The bugs that take down production almost never show up in staging. They're the bugs that emerge from scale, the race condition that only triggers when two users hit the same endpoint within three milliseconds. The memory leak that only manifests after processing ten million records. The edge case in your date parsing that only fires on the last day of February in a leap year, with a user in a timezone that observes daylight saving time.

We've all seen it. The feature that worked perfectly in staging, passed every test in CI, got a thumbs-up in code review, and then melted down within an hour of reaching real users. Not because anyone was careless, but because production is a fundamentally different environment. Pretending otherwise doesn't make your software more reliable. It just means you're surprised more often.

What “Testing in Production” Actually Means

Let's be crystal clear about what this is not. Testing in production is not deploying untested code and crossing your fingers. It's not skipping your test suite. It's not replacing QA with a prayer. If that's your mental model, I understand the skepticism.

Real testing in production is a disciplined engineering practice. It means deploying code that has already passed your pre-production tests, unit tests, integration tests, staging validation, and then using a carefully controlled process to verify it performs correctly under real-world conditions. It's the final layer of validation, not the only layer.

The toolkit looks like this:

  • Feature flags to control who sees new code and when, letting you decouple deployment from release.
  • Canary deployments to route a small percentage of traffic to the new version while monitoring for anomalies.
  • Observability infrastructure, structured logging, distributed tracing, and real-time dashboards, so you can see what's happening the moment it happens.
  • Automated rollbacks that revert changes faster than any human could react when metrics cross predefined thresholds.

Done right, this approach is actually safer than the traditional “test everything in staging, then deploy to 100% of users at once” strategy. Because with the traditional approach, when something does go wrong, it goes wrong for everyone simultaneously.

Feature Flags: Your Production Safety Net

Feature flags are perhaps the most powerful tool in the production-testing arsenal. The concept is simple: deploy your new code to production, but wrap it behind a conditional. The code is there, running on your production servers, but only activated for a subset of users you choose.

Start with 1% of traffic. Watch your error rates. Monitor response times. Check that the new code path isn't consuming more memory or making extra database queries. Look at user behavior metrics, are people completing the flow, or are they dropping off at a new step? If everything looks healthy, bump it to 5%. Then 20%. Then 50%. Then everyone.

If something goes wrong at any point, you don't need a rollback. You don't need a new deployment. You flip the flag. The old code path is still there, still warm, still serving the other 99% of users. Recovery takes seconds, not minutes.

This isn't theoretical. Netflix runs hundreds of feature flags simultaneously. Google uses flags for virtually every change that reaches users. Facebook's Gatekeeper system manages thousands of concurrent experiments. These companies don't test in production because they're reckless, they do it because they're operating at a scale where it's the only way to be confident that changes work. And they've built the tooling to do it safely.

Canary Deployments: Controlled Exposure

Where feature flags operate at the application layer, canary deployments work at the infrastructure layer. The idea: instead of deploying your new version to all servers at once, you deploy it to a small subset, maybe one server out of fifty, or 2% of your pods in a Kubernetes cluster.

Real traffic flows to both the canary (new version) and the baseline (current version). Automated systems continuously compare key metrics between the two: error rate, latency at p50/p95/p99, CPU and memory usage, downstream dependency health. If the canary's metrics deviate beyond acceptable thresholds, the system automatically rolls back before most users are affected.

The beauty of this approach is its statistical rigor. You're not relying on someone eyeballing a dashboard. The comparison is automated, the thresholds are predefined, and the rollback is mechanical. Human judgment decided what “healthy” looks like. Machines enforce it at the speed of software.

Canary deployments also catch an entire category of bugs that are invisible in pre-production: infrastructure-sensitive issues. Maybe the new version doesn't play well with your current kernel version. Maybe it triggers a subtle incompatibility with your load balancer's connection pooling. Maybe it increases garbage collection pressure in a way that only manifests under sustained load. These are real bugs, and the only environment that reliably surfaces them is production itself.

Observability Is the New Testing

There's a profound shift happening in how we think about software correctness. Traditional testing asks a question before deployment: “Will this work?” Observability asks a question during and after deployment: “Is this working, right now, for real users?”

Modern observability goes far beyond checking if a server is up. Structured logging turns every request into a searchable, filterable event with rich context. who made the request, what code path executed, how long each step took, what data was involved. Distributed tracing follows a single user action across dozens of microservices, showing you exactly where time was spent and where failures originated. Real-time dashboards aggregate these signals into a living picture of your system's health at millisecond granularity.

When your observability is strong enough, you can catch bugs that no pre-production test suite could ever find. The endpoint that works fine with small payloads but degrades with the 50MB files that a handful of enterprise customers upload. The authentication flow that breaks for users whose email addresses contain a plus sign. The search feature that returns nonsensical results for queries in languages your test suite doesn't cover.

This doesn't mean you replace unit tests with Grafana dashboards. It means you recognize that testing is a spectrum. Pre-production tests verify your assumptions. Production observability validates reality. The monitoring isn't a substitute for testing, it's a different kind of testing, one that operates in the only environment that truly matters.

When NOT to Test in Production

If I've been making this sound like a silver bullet, let me correct that now. There are domains where testing in production is genuinely irresponsible, and knowing the boundary is part of being a thoughtful engineer.

  • Financial transactions: If a bug could charge someone's credit card twice, or transfer money to the wrong account, or miscalculate interest on a loan, that code needs to be airtight before it touches real money. The cost of a production bug isn't a bad user experience; it's regulatory liability and broken trust.
  • Healthcare systems: When software controls medication dosages, diagnostic results, or patient records, a production bug can cause physical harm. The risk calculus is fundamentally different from a content recommendation engine.
  • Security-critical paths: Authentication, authorization, encryption, access control. A feature flag that accidentally exposes admin functionality to all users isn't a learning opportunity, it's an incident. These code paths demand exhaustive pre-production verification.
  • Irreversible operations: Anything that deletes data, sends notifications to users, or triggers external processes that can't be undone. You can roll back a deployment, but you can't unsend a million emails.

The principle is straightforward: the higher the cost of failure, the more you should invest in pre-production testing. Know the difference between “this button color might be wrong” and “this payment might charge twice.” Production testing is a powerful strategy for the former. For the latter, be exhaustive before you ship, and then still monitor obsessively once you do.

The Mindset Shift

The traditional testing mindset asks: “Does the code work?” It's a binary question with a binary answer. The tests pass or they don't. The build is green or red. And when everything is green, you ship with confidence.

Production testing asks a richer question: “Does the code work here, with these users, at this scale, with this data, under these network conditions, alongside these other services?” That's not a binary question. It's a continuous assessment, and the answer can change minute by minute as conditions shift.

This shift requires a different relationship with certainty. Instead of demanding proof that the code is correct before deployment, you accept that correctness is an ongoing verification. You build the infrastructure to detect problems quickly and recover from them automatically. You optimize for mean time to recovery, not just mean time between failures.

Neither pre-production testing nor production testing is sufficient on its own. The teams that ship the most reliable software at the highest velocity do both, rigorously, intentionally, and without pretending that either one replaces the other. They test before production to catch the obvious problems. They test in production to catch the subtle ones. And they've made peace with the fact that software in the real world is messier, more surprising, and more interesting than any test environment will ever be.

Ship It, Watch It, Learn From It

The best engineers I've worked with share a common trait: they're not afraid of production. They respect it. They've built the tooling, the processes, and the habits to deploy with confidence, not because they've eliminated risk, but because they've made risk manageable. Feature flags, canary deployments, observability, automated rollbacks. These aren't shortcuts. They're engineering discipline applied to the hardest environment there is.

So the next time someone tells you they “never test in production,” smile politely. What they mean is they haven't built the infrastructure to do it safely yet. Because if their software is running in production and serving real users, they're testing in production whether they know it or not. The only question is whether they're learning from it.