Skip to content

Apache Kafka for the Mid-Market: Practical Implementation

A
abemon
| | 12 min read | Written by practitioners
Share

Kafka is not just for giants

There is a widespread perception that Apache Kafka is a tool for companies processing millions of events per second. LinkedIn, Netflix, Uber. And yes, that is where it was born and where it shines. But Kafka solves problems that companies of all sizes have: decoupling data producers and consumers, guaranteeing event ordering, and letting multiple systems read the same data without interfering with each other.

The question is not whether your company is big enough for Kafka. The question is whether you have a problem Kafka solves and whether the operational complexity is justified. For a mid-sized company processing between 10,000 and 1,000,000 events daily — e-commerce orders, warehouse movements, financial transactions, application logs — Kafka is a viable option and, in many cases, the best one.

When Kafka makes sense (and when it does not)

Kafka makes sense when:

  • Multiple systems need to react to the same event. A new order must update stock, notify the warehouse, send customer confirmation, and feed the operations dashboard. Without Kafka, this becomes a tangle of point-to-point HTTP calls that breaks every time a system is down.
  • You need ordering guarantees. Events must process in sequence: payment before shipment, reservation before confirmation. Traditional queues do not guarantee order across different consumers.
  • You need replayability. You want to reprocess past events when you fix a bug or add a new consumer. Kafka retains messages (configurable, from hours to indefinitely), allowing you to “rewind” and reprocess.

Kafka does not make sense when:

  • You have low event volume and few consumers. Under 1,000 events daily with 2-3 consumers, a simple queue (Redis Streams, SQS, RabbitMQ) solves the problem with a fraction of the operational complexity.
  • You need RPC or request-reply. Kafka is an asynchronous messaging system. If you need to send a request and wait for a response, it is the wrong tool.
  • You lack operational capacity. A Kafka cluster requires attention: monitoring, partition rebalancing, disk management. If your team cannot dedicate time to this, consider managed services (Confluent Cloud, Amazon MSK, Upstash Kafka) or a simpler alternative.

Sizing for the mid-market

Over-provisioning is the most common mistake. A team reads that LinkedIn runs hundreds of brokers and assumes they need something similar. For a mid-sized company, the numbers are far more modest.

Volume estimation: Calculate your peak throughput, not average. If you process 100,000 events daily distributed evenly, that is about 70 events per minute. But if 60% of traffic occurs in 4 hours, the peak rises to 250 events per minute. Add a 3x factor for unexpected spikes. For 100,000 daily events, size for 750 events per minute.

Message size: Average size matters for storage and network throughput calculations. A typical order event in JSON with all business fields runs 1KB to 5KB. A log event might be 500 bytes. An event with attached documents (not recommended, but it happens) can exceed 50KB. For a typical mid-market company, assume 2KB average.

Minimum viable infrastructure:

For 10K-100K events/day:

  • 3 brokers (the minimum for fault tolerance) with 2 vCPUs and 4GB RAM each.
  • SSD storage: 50GB per broker (with 7-day retention and 2KB average message size).
  • ZooKeeper: 3 nodes with 1 vCPU and 2GB RAM (or use KRaft, the ZooKeeper-free mode available since Kafka 3.3, which eliminates this dependency).

For 100K-1M events/day:

  • 3-5 brokers with 4 vCPUs and 8GB RAM.
  • SSD storage: 100-200GB per broker.
  • Seriously consider a managed service at this volume. Confluent Cloud or Amazon MSK costs EUR 200-500 monthly for this throughput and saves you the operations overhead.

Real cost: A minimal self-managed cluster on AWS (3 t3.medium instances for brokers, 3 t3.small for ZooKeeper) costs approximately EUR 250 per month. Managed services start at EUR 150 monthly for the basic tier and scale with throughput. The decision between self-managed and managed depends on your team’s time cost, not just infrastructure cost.

Topic design

Topic design is where most mistakes happen and where getting it right has the most impact.

One entity, one topic. The most robust pattern: one topic per business entity type. orders, payments, shipments, inventory-updates. Each topic contains all events related to that entity: creation, update, cancellation. The event type goes in a message field, not in the topic name.

Do not fall into the temptation of creating topics per event type (order-created, order-updated, order-cancelled). This multiplies the number of topics, complicates consumption (a consumer that needs the full order state must read three topics), and makes it impossible to guarantee ordering between events for the same order.

Partitions: The partition count determines maximum consumption parallelism. Each partition can have a single active consumer within a consumer group. Rule of thumb: start with the number of parallel consumers you need, multiplied by 2 (for growth). For mid-market companies, 6-12 partitions per topic is a reasonable starting point.

Important: once a topic is created with N partitions, you can add but not reduce. Start conservative.

Partition key: The key determines which partition receives each message. Messages with the same key always go to the same partition, guaranteeing order. Use the entity ID: order_id for the orders topic, customer_id if you need all customer events in order. Without a key, Kafka distributes round-robin and you lose ordering guarantees.

Retention: Kafka defaults to 7-day message retention. For mid-market deployments, we recommend 14 days minimum (covers two weekends, short holidays, and most reprocessing scenarios). For critical topics like payments and orders, 30 days or unlimited with compaction (Kafka keeps only the last message per key — ideal for state).

Schema: Register each topic’s schema in a Schema Registry (Confluent Schema Registry or Apicurio). Use Avro or Protobuf, not raw JSON. JSON is tempting for its simplicity, but without schema enforcement nothing prevents a producer from sending a message with a renamed field or changed type, breaking downstream consumers. With Avro and Schema Registry, incompatible changes are rejected before publishing.

Consumer groups: the pattern you must master

Consumer groups are Kafka’s mechanism for distributing work among consumers. Each group receives a copy of every message. Within a group, each message goes to exactly one consumer.

Practical example: the orders topic has 3 consumer groups.

  • Consumer group order-processing: processes the order, updates stock, generates delivery note. 3 instances for parallelism.
  • Consumer group analytics: feeds the real-time operations dashboard. 1 instance sufficient.
  • Consumer group notification-service: sends confirmation email and SMS. 2 instances.

Each group reads all messages independently. Within order-processing, the 3 consumers divide the partitions. If the topic has 12 partitions, each consumer processes 4.

Consumer lag is the critical metric. It measures how many messages are pending processing in a consumer group. Growing lag means consumers are not keeping pace with the producer. Common causes: slow per-consumer processing (optimize the logic or add instances), frequent rebalancing (tune session.timeout.ms and heartbeat.interval.ms), or a stuck consumer (timeout on a downstream HTTP call).

Monitor consumer lag with Kafka Exporter + Prometheus + Grafana. It is the most important metric in your cluster after disk space.

Operations: the runbook you need

Kafka in production requires regular attention. These are the procedures we document for every deployment.

Continuous monitoring:

  • Disk space per broker (alert at 70%, critical at 85%).
  • Consumer lag per group (alert if growing for 5 consecutive minutes).
  • Under-replicated partitions (should be 0; any value indicates a struggling broker).
  • Produce and fetch request latency (p99 below 100ms for mid-market).

Periodic maintenance:

  • Review and adjust retention monthly (data grows; disks do not).
  • Verify downstream data pipelines process at expected rates.
  • Update brokers quarterly (Kafka releases security fixes frequently).
  • Test recovery: simulate a broker failure and verify the cluster recovers without data loss.

Failure scenarios and response:

Broker down: With 3 brokers and one down, the cluster continues operating on replicas. Verify partitions have been reassigned. Restart the broker or replace it. If it does not recover within 30 minutes, force partition reassignment.

Stuck consumer: Identify which consumer group has growing lag. Check consumer logs. Common causes: uncaught exception, timeout on external dependency, thread deadlock. Restart the consumer. If the problem persists, pause the consumer group and investigate.

Disk full: Emergency. Temporarily reduce retention (retention.ms), delete non-critical topics, or add disk. This should never happen if the 70% alert works.

Alternatives to consider first

Before installing Kafka, evaluate whether a simpler solution solves your problem.

Redis Streams: If your volume is low (<10K events/day), Redis Streams offers semantics similar to Kafka (consumer groups, persistence) with much simpler operations. Many companies already run Redis for caching, so it is not a new dependency.

Amazon SQS/SNS: If you are on AWS and need basic queuing and pub-sub without global ordering guarantees, SQS+SNS is serverless, requires no operations, and costs pennies up to considerable volumes.

RabbitMQ: If you need complex routing (exchanges, bindings, dead letter queues) and your volume is moderate, RabbitMQ is more mature in that space and simpler to operate than Kafka.

The decision is always the same: use the simplest tool that solves your current problem with room to grow. If that tool is Kafka, implement it well. If it is not, do not force it.

Starting without over-engineering

If you decide Kafka is the right tool, the implementation plan for a mid-sized company:

  1. Week 1-2: Set up the cluster (3 brokers, preferably with KRaft). Configure basic monitoring. Create the initial 3-5 topics with registered schemas.
  2. Week 3-4: Migrate the first use case. Pick the simplest one (usually logs or audit events). Verify that producing, consuming, and monitoring all work.
  3. Week 5-8: Migrate business use cases. One at a time. Orders, payments, inventory. Each migration includes: producer, consumer(s), lag monitoring, and a rollback playbook.
  4. Month 3+: Stabilize. Adjust partitions and retention based on real data. Document runbooks. Train the team on operations.

If you need help designing and deploying your Kafka cluster, our data engineering team has implemented this architecture for mid-sized companies across logistics and retail. The most common mistake is not choosing the wrong tool. It is implementing everything at once instead of incrementally. Start with one topic. Learn. Add the next.

About the author

A

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.