Skip to content

Testing in Production: Canary Deployments, Feature Flags, and Chaos Engineering

A
abemon
| | 11 min read | Written by practitioners
Share

Staging lies

Staging is an environment that looks like production but isn’t production. The data is different (less volume, less variety, fewer concurrent users). Third-party integrations are in sandbox mode. Traffic is synthetic. And therefore, the problems staging catches are a subset of what production will reveal.

This isn’t an argument against staging. Staging is necessary. It’s an argument in favor of complementing staging with testing in production: techniques designed to validate code with real traffic, real users, and the real complexity of the system in operation.

Testing in production doesn’t mean “deploy and pray.” It means deploying in a controlled manner, observing metrics in real time, and having the ability to revert in seconds if something goes wrong. It’s engineering, not recklessness.

Canary deployments: the thermometer before the plunge

A canary deployment deploys the new version to a small subset of users (typically 1-5%) while the rest continues with the previous version. If the canary metrics are good, it progresses incrementally. If they’re bad, it reverts.

Practical implementation

In Kubernetes: the most common approach uses two Deployments (stable and canary) with an Ingress that distributes traffic by weight. Istio and Linkerd support native traffic splitting. Argo Rollouts automates the entire process: deploys the canary, monitors metrics, advances or reverts automatically.

# Argo Rollouts - canary strategy
spec:
  strategy:
    canary:
      steps:
      - setWeight: 5
      - pause: {duration: 10m}
      - setWeight: 20
      - pause: {duration: 10m}
      - setWeight: 50
      - pause: {duration: 10m}
      - setWeight: 100
      analysis:
        templates:
        - templateName: error-rate
        args:
        - name: service-name
          value: api-server

In serverless (AWS Lambda): Lambda aliases with traffic shifting. You configure an alias that distributes traffic between two versions. CodeDeploy automates the canary progression based on CloudWatch alarms.

On traditional servers: a load balancer redirecting a percentage of traffic to the server with the new version. Nginx with split_clients, HAProxy with weight, or your cloud’s load balancer (ALB weighted target groups).

The metrics that determine canary success

The canary passes if the new version’s metrics are equivalent to (or better than) the stable version. Critical metrics:

  • Error rate: compare the 5xx error percentage between canary and stable. If the canary has a 2x higher error rate, something is wrong.
  • Latency p95/p99: mean latency is misleading. A 5% of extremely slow requests can hide behind an acceptable average. The p99 doesn’t lie.
  • Business metrics: conversion rate, completed orders, processed payments. A canary that generates no technical errors but reduces conversions by 20% is a failed canary.

The observation interval matters. 10 minutes at 5% with 1,000 requests/hour is only 8 requests. Statistically insignificant. To have confidence, you need a minimum volume per canary step. The practical rule: at least 100 requests on the canary before advancing to the next step.

When not to use canary

Canary doesn’t work well for database changes. If the new version requires a schema migration, the canary and stable version read the same database. The new schema must be compatible with both versions (expand and contract pattern), which limits the types of changes you can make incrementally.

It also doesn’t work when changes affect shared state (caches, queues) in incompatible ways. If the canary version produces messages in a format the stable version can’t consume, the 5% canary traffic can corrupt data for the other 95%.

Feature flags: the granular switch

Feature flags decouple deployment from release. You deploy new code to production but keep it deactivated. You activate it for specific users, a percentage of traffic, or nobody (dark launch). This allows deploying continuously without risk and releasing when the business decides.

Types of feature flags

Release flags: activate new features. The most common case. “Show the new checkout to beta segment users.” Short-lived: activated, validated, and removed (along with the old path code).

Experiment flags: for A/B testing. “50% of users see the blue button, 50% the green.” Maintained until statistically significant results are available.

Operational flags: manual circuit breakers. “If the payment service has issues, show the maintenance message.” No removal date. They’re permanent infrastructure.

Permission flags: enable features by user or segment. “Premium users see the advanced dashboard.” Close to permission management, but implemented as flags for greater flexibility.

Tools

LaunchDarkly is the market leader (valued at $3 billion). SDKs for all languages, advanced targeting, server-side evaluation with streaming. From $8.33/month per seat (Starter plan). For teams of 10+, cost scales quickly.

Unleash is the most mature open-source alternative. Self-hosted or cloud. Supports targeting, gradual rollouts, and variants. For teams that don’t want dependency on a SaaS for something as critical as feature flags.

Flagsmith, OpenFeature, and AWS AppConfig with feature flags are intermediate options.

Our recommendation: for teams of up to 20 developers, Unleash self-hosted. Zero license cost, full control, and sufficient functionality for 90% of cases. LaunchDarkly for large organizations where support and enterprise tool integration justify the cost.

Feature flag technical debt

A feature flag that stays in the code 6 months after launch is technical debt. Each flag adds a conditional path that must be maintained and tested. With 50 active flags, the possible combinations are astronomical.

Strict rule: every release feature flag has a removal date assigned at the moment of creation. In our projects, we use a bot that alerts when a flag exceeds 30 days without being removed. Without this discipline, we’ve seen codebases with over 200 flags, where nobody knew which were active and which were vestiges of features launched two years ago.

Progressive rollouts: the complete model

Canary and feature flags combine in progressive rollouts: a deployment model where code is exposed to real users gradually and in a controlled manner.

The typical flow:

  1. Deploy with feature flag deactivated (dark launch)
  2. Activate for the internal team (dogfooding)
  3. Expand to 1% of real users
  4. Observe metrics for 24 hours
  5. Expand to 10%, observe 24 hours
  6. Expand to 50%, observe 24 hours
  7. Full rollout (100%)
  8. Cleanup: remove the feature flag and old path code

Each step has advancement criteria (metrics within thresholds) and reversion criteria (metrics outside thresholds). Full automation is possible with tools like Argo Rollouts + Prometheus + feature flag service, but even manual, the model drastically reduces risk.

Sounds slow? It is. A feature that used to deploy in a day now takes a week to reach 100%. But the number of production incidents decreases proportionally. For a client with daily deployments, switching from “deploy and pray” to progressive rollouts reduced post-deploy incidents by 85%.

Chaos engineering: break to strengthen

Chaos engineering is the practice of deliberately introducing failures in production to verify the system tolerates them. It’s not random destructive testing. It’s controlled experimentation with clear hypotheses.

The scientific model

A chaos engineering experiment follows the scientific method:

  1. Hypothesis: “If one instance of the payment service fails, the load balancer redirects traffic to healthy instances and users don’t experience errors.”
  2. Variables: what failure? (process kill, injected latency, network partition). What do we measure? (error rate, latency, availability).
  3. Blast radius: how many users could be affected? Start with minimum blast radius.
  4. Execute: introduce the failure.
  5. Observe: does the hypothesis hold?
  6. Learn: if it doesn’t, why? What needs improving?

Tools

Chaos Monkey from Netflix is the original tool: randomly kills VM instances in production. Simple but effective for validating that services tolerate instance loss.

Litmus Chaos is the CNCF project for chaos engineering in Kubernetes. Supports experiments like pod kill, network delay, disk fill, and CPU stress. Defined as Kubernetes resources (ChaosEngine, ChaosExperiment), which facilitates CI/CD integration.

Gremlin is the most complete SaaS option: web interface, precise targeting, halt button to stop the experiment immediately. From $2,000/year for small teams.

AWS Fault Injection Simulator (FIS) allows injecting failures in AWS services (stop EC2 instances, throttle DynamoDB, disconnect subnets). Native, agentless, easy to configure for teams already on AWS.

Where to start

Don’t start killing instances in production on day one. The incremental path:

Level 1 - Game days: manual simulations in a non-production environment. “What happens if service X stops responding?” The team investigates and documents the impact. Zero automation, maximum learning.

Level 2 - Automated chaos in staging: run Litmus or FIS experiments in staging automatically. Validate that response runbooks work.

Level 3 - Chaos in production with minimum blast radius: kill one pod of a non-critical service in production during business hours, with the team observing. Verify that auto-scaling and health checks work.

Level 4 - Continuous chaos in production: run automated experiments periodically (weekly, for example) against critical services. The system should tolerate these failures without human intervention.

Most teams never get past level 2. And that’s fine. Manual game days already generate enormous value by exposing incorrect assumptions about system resilience.

The culture behind the tools

Testing in production requires a specific organizational culture. Teams that do it well share three characteristics:

Trust in observability: if you can’t see the impact of a change in real time, you can’t test in production safely. Dashboards, alerts, and traces are prerequisites, not extras.

Fast reversion as a priority: the time to revert a deploy should be under 5 minutes. If reverting requires a 30-minute manual process, the risk of testing in production is unacceptable. Automate reversion before automating deployment.

Blameless culture: when a canary fails and needs reverting, that’s not an error. It’s the system working as designed. If the team fears deploying because a rollback is perceived as failure, nobody will deploy incrementally.

Testing in production doesn’t replace prior testing. It complements a robust CI/CD pipeline with unit tests, integration tests, and staging testing. It’s the last layer of validation, the one that verifies the system works with the only variable you can’t simulate: reality.

Our cloud and DevOps team implements deployment pipelines with canary, feature flags, and progressive rollouts. If your organization wants to reduce post-deploy incidents, we can design the right testing-in-production strategy for your stack and team. Also check our article on microservices observability for the monitoring prerequisites.

About the author

A

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.