Kafka vs Pulsar vs Kinesis in 2026 — which should I choose?

Kafka (or Redpanda) remains the default for teams that control their own infrastructure: largest ecosystem, best tooling, proven at every scale. Pulsar's multi-tenancy and tiered storage are genuine advantages for large orgs running many isolated pipelines on shared clusters — not worth the operational cost otherwise. Kinesis is the right call only when you're all-in on AWS and want zero operational burden; the 7-day retention cap and 1 MB/s per-shard limits bite you faster than expected.

When should I use Flink instead of Spark Structured Streaming?

Use Flink when you need true event-time processing with low latency (sub-second), complex stateful logic (pattern detection, session windows), or exactly-once guarantees across system boundaries. Spark Structured Streaming is the pragmatic choice when your team already runs Spark, latency in low seconds is acceptable, and you want to reuse existing data engineering skills — it handles 80% of near-real-time workloads well. The crossover point is roughly when your SLO drops below 1 second or your windowing logic involves joins across multiple streams.

What end-to-end latency can I realistically achieve?

With Kafka + Flink on dedicated hardware: p50 < 50 ms, p99 < 200 ms is consistently achievable. With MSK + Flink on Fargate: add 50–100 ms due to network hops. With Spark Structured Streaming in micro-batch mode (trigger interval 500 ms): p99 in the 1–3 s range. Kinesis + Lambda: 500 ms–2 s depending on batch window. Budget an extra 50–150 ms for each external enrichment lookup (Redis, Postgres) on the hot path.

How do I size a Kafka cluster for a given event throughput?

Start with: (peak events/s × avg event size × replication factor) / per-broker disk throughput. For 100 K events/s at 1 KB each with RF=3, you need ~300 MB/s write throughput — two to three m6i.2xlarge brokers cover this comfortably. Add 20% headroom for rebalancing and consumer catch-up. Partition count: target 1–4 partitions per consumer thread, with a ceiling of 200 partitions per broker before operational complexity rises steeply. Monitor disk I/O saturation and network bandwidth before CPU.

Is ksqlDB ready for production in 2026?

Yes, for well-defined use cases: filtering, lightweight transformations, simple aggregations, and CDC event routing. It is not the right tool for complex stateful joins, large state stores, or sub-100 ms latency requirements. The SQL interface lowers the barrier for data analysts, but the operational model (embedded Kafka Streams under the hood) can surprise teams when they need to tune parallelism or manage state compaction. Use it when the simplicity tradeoff is worth it; reach for Flink when the logic gets complex.

What is the cost difference between MSK and self-managed Kafka?

In our deployments, AWS MSK for a 3-broker cluster handling ~100 K events/s costs $1 800–2 400/month including storage and data transfer. Equivalent self-managed Kafka on EC2 (3 × m6i.2xlarge + EBS) runs $500–700/month. The 3–4× MSK premium buys automated patching, broker replacement, and CloudWatch integration. For teams without dedicated Kafka expertise, MSK often pays for itself in avoided incident hours. For teams with operational maturity, self-managed Redpanda on Kubernetes cuts costs further.

How do I handle exactly-once semantics across Kafka and an external database?

True distributed exactly-once requires the sink to participate in Kafka's transaction protocol, which most databases do not support natively. The practical pattern: use Flink checkpointing with Kafka exactly-once producers, and design all sink writes as idempotent upserts keyed on a deterministic event ID. This gives you effective exactly-once at the application level without distributed transaction overhead. For PostgreSQL sinks, ON CONFLICT DO UPDATE with a stable primary key derived from the event is the standard implementation.

What monitoring stack works best for streaming pipelines?

Prometheus + Grafana is the baseline for broker metrics (JMX exporter for Kafka, built-in metrics for Redpanda), consumer lag (Burrow for Kafka, native for Redpanda), and Flink job metrics (via the Flink Prometheus reporter). Add alerting on: consumer lag growth rate (not just absolute value), DLQ event rate, checkpoint duration (Flink), and end-to-end synthetic event latency. The synthetic canary pattern — a known event injected every 60 seconds and verified at the sink — catches systemic failures no individual metric surfaces.

Kafka vs Pulsar vs Kinesis in 2026 — which should I choose?

Kafka (or Redpanda) remains the default for teams that control their own infrastructure: largest ecosystem, best tooling, proven at every scale. Pulsar's multi-tenancy and tiered storage are genuine advantages for large orgs running many isolated pipelines on shared clusters — not worth the operational cost otherwise. Kinesis is the right call only when you're all-in on AWS and want zero operational burden; the 7-day retention cap and 1 MB/s per-shard limits bite you faster than expected.

When should I use Flink instead of Spark Structured Streaming?

Use Flink when you need true event-time processing with low latency (sub-second), complex stateful logic (pattern detection, session windows), or exactly-once guarantees across system boundaries. Spark Structured Streaming is the pragmatic choice when your team already runs Spark, latency in low seconds is acceptable, and you want to reuse existing data engineering skills — it handles 80% of near-real-time workloads well. The crossover point is roughly when your SLO drops below 1 second or your windowing logic involves joins across multiple streams.

What end-to-end latency can I realistically achieve?

With Kafka + Flink on dedicated hardware: p50 < 50 ms, p99 < 200 ms is consistently achievable. With MSK + Flink on Fargate: add 50–100 ms due to network hops. With Spark Structured Streaming in micro-batch mode (trigger interval 500 ms): p99 in the 1–3 s range. Kinesis + Lambda: 500 ms–2 s depending on batch window. Budget an extra 50–150 ms for each external enrichment lookup (Redis, Postgres) on the hot path.

How do I size a Kafka cluster for a given event throughput?

Start with: (peak events/s × avg event size × replication factor) / per-broker disk throughput. For 100 K events/s at 1 KB each with RF=3, you need ~300 MB/s write throughput — two to three m6i.2xlarge brokers cover this comfortably. Add 20% headroom for rebalancing and consumer catch-up. Partition count: target 1–4 partitions per consumer thread, with a ceiling of 200 partitions per broker before operational complexity rises steeply. Monitor disk I/O saturation and network bandwidth before CPU.

Is ksqlDB ready for production in 2026?

Yes, for well-defined use cases: filtering, lightweight transformations, simple aggregations, and CDC event routing. It is not the right tool for complex stateful joins, large state stores, or sub-100 ms latency requirements. The SQL interface lowers the barrier for data analysts, but the operational model (embedded Kafka Streams under the hood) can surprise teams when they need to tune parallelism or manage state compaction. Use it when the simplicity tradeoff is worth it; reach for Flink when the logic gets complex.

What is the cost difference between MSK and self-managed Kafka?

In our deployments, AWS MSK for a 3-broker cluster handling ~100 K events/s costs $1 800–2 400/month including storage and data transfer. Equivalent self-managed Kafka on EC2 (3 × m6i.2xlarge + EBS) runs $500–700/month. The 3–4× MSK premium buys automated patching, broker replacement, and CloudWatch integration. For teams without dedicated Kafka expertise, MSK often pays for itself in avoided incident hours. For teams with operational maturity, self-managed Redpanda on Kubernetes cuts costs further.

How do I handle exactly-once semantics across Kafka and an external database?

True distributed exactly-once requires the sink to participate in Kafka's transaction protocol, which most databases do not support natively. The practical pattern: use Flink checkpointing with Kafka exactly-once producers, and design all sink writes as idempotent upserts keyed on a deterministic event ID. This gives you effective exactly-once at the application level without distributed transaction overhead. For PostgreSQL sinks, ON CONFLICT DO UPDATE with a stable primary key derived from the event is the standard implementation.

What monitoring stack works best for streaming pipelines?

Prometheus + Grafana is the baseline for broker metrics (JMX exporter for Kafka, built-in metrics for Redpanda), consumer lag (Burrow for Kafka, native for Redpanda), and Flink job metrics (via the Flink Prometheus reporter). Add alerting on: consumer lag growth rate (not just absolute value), DLQ event rate, checkpoint duration (Flink), and end-to-end synthetic event latency. The synthetic canary pattern — a known event injected every 60 seconds and verified at the sink — catches systemic failures no individual metric surfaces.

Real-time data pipelines: Kafka vs Flink vs Spark in 2026

Leer este articulo en espanol

Articles

Data Engineering

Real-Time Data Pipelines 2026: Kafka vs Flink vs Spark

Q: How do I size a Kafka cluster for a given event throughput?

Start with: (peak events/s × avg event size × replication factor) / per-broker disk throughput. For 100 K events/s at 1 KB each with RF=3, you need ~300 MB/s write throughput — two to three m6i.2xlarge brokers cover this comfortably. Add 20% headroom for rebalancing and consumer catch-up. Partition count: target 1–4 partitions per consumer thread, with a ceiling of 200 partitions per broker before operational complexity rises steeply. Monitor disk I/O saturation and network bandwidth before CPU.

Q: Is ksqlDB ready for production in 2026?

Yes, for well-defined use cases: filtering, lightweight transformations, simple aggregations, and CDC event routing. It is not the right tool for complex stateful joins, large state stores, or sub-100 ms latency requirements. The SQL interface lowers the barrier for data analysts, but the operational model (embedded Kafka Streams under the hood) can surprise teams when they need to tune parallelism or manage state compaction. Use it when the simplicity tradeoff is worth it; reach for Flink when the logic gets complex.

Q: What is the cost difference between MSK and self-managed Kafka?

In our deployments, AWS MSK for a 3-broker cluster handling ~100 K events/s costs $1 800–2 400/month including storage and data transfer. Equivalent self-managed Kafka on EC2 (3 × m6i.2xlarge + EBS) runs $500–700/month. The 3–4× MSK premium buys automated patching, broker replacement, and CloudWatch integration. For teams without dedicated Kafka expertise, MSK often pays for itself in avoided incident hours. For teams with operational maturity, self-managed Redpanda on Kubernetes cuts costs further.

Q: How do I handle exactly-once semantics across Kafka and an external database?

True distributed exactly-once requires the sink to participate in Kafka's transaction protocol, which most databases do not support natively. The practical pattern: use Flink checkpointing with Kafka exactly-once producers, and design all sink writes as idempotent upserts keyed on a deterministic event ID. This gives you effective exactly-once at the application level without distributed transaction overhead. For PostgreSQL sinks, ON CONFLICT DO UPDATE with a stable primary key derived from the event is the standard implementation.

Q: What monitoring stack works best for streaming pipelines?

Prometheus + Grafana is the baseline for broker metrics (JMX exporter for Kafka, built-in metrics for Redpanda), consumer lag (Burrow for Kafka, native for Redpanda), and Flink job metrics (via the Flink Prometheus reporter). Add alerting on: consumer lag growth rate (not just absolute value), DLQ event rate, checkpoint duration (Flink), and end-to-end synthetic event latency. The synthetic canary pattern — a known event injected every 60 seconds and verified at the sink — catches systemic failures no individual metric surfaces.

abemon

| January 15, 2026 | Updated on April 16, 2026 | 27 min read

End-to-end latency: Kafka + Flink achieves p99 < 200 ms; Spark Structured Streaming targets p99 < 3 s — right for near-real-time, wrong for fraud detection.
Throughput ceiling: a 3-broker Kafka cluster handles 500 K–1 M events/s sustained; Redpanda reaches the same numbers with 40% fewer nodes and no ZooKeeper.
Cost benchmark: self-managed Kafka on 3 × m6i.2xlarge runs ~$600/mo; AWS MSK for equivalent throughput runs $1 800–2 400/mo — 3–4× premium for managed operations.
Stack verdict for greenfield projects: Redpanda (broker) + Flink on Kubernetes (processor) + ClickHouse (analytical sink) — operational simplicity beats raw Kafka at scales below 5 M events/s.
Exactly-once end-to-end requires idempotent sinks; at-least-once + upsert writes covers 95% of use cases with a fraction of the coordination overhead.

Do you actually need real-time?

Before designing a streaming architecture, the first question every engineering team should answer honestly is whether they actually need real-time data. In our experience building data infrastructure for logistics, hospitality, and retail clients, the answer is “no” more often than teams expect.

Real-time means sub-second latency from event to action. Near-real-time means seconds to low minutes. Batch means minutes to hours. The cost and complexity difference between these tiers is enormous, and most business requirements that get labeled “real-time” are actually near-real-time at best.

A logistics dashboard that updates shipment positions every 30 seconds is near-real-time. A hospitality system that syncs reservations every 5 minutes is near-real-time. A retail inventory system that prevents overselling on flash sales — that might genuinely need real-time. The distinction matters because a well-designed batch or micro-batch pipeline running every 60 seconds can serve 80% of use cases at a fraction of the operational cost.

The honest diagnostic is this: what is the business cost of a 60-second delay versus a 1-second delay? If nobody can articulate a concrete dollar figure, the team is probably over-engineering. Start with the simplest architecture that meets actual requirements and evolve toward streaming only when the data proves you need it.

That said, when you genuinely need real-time — fraud detection, live pricing engines, IoT sensor processing, real-time personalization — the investment is justified. The rest of this guide assumes you have confirmed that your use case warrants it.

Stack benchmarks: latency vs cost (from production)

The table below summarises measurements taken across three production deployments in 2025–2026. Numbers are p50/p99 end-to-end: producer write acknowledged → event durable in broker → consumer has processed it. Costs are for eu-west-1 (Ireland) at 2026-Q1 AWS on-demand pricing and include compute, EBS storage, and cross-AZ data transfer. “Operational complexity” is a rough FTE-point estimate — the share of a full-time SRE headcount a mature team should budget to keep the stack healthy at steady state.

Stack	Use case	p50 latency	p99 latency	Throughput / node	$/TB processed (1y avg)	Operational complexity
Kafka + Flink (self-managed)	High-throughput ETL, stateful joins	80 ms	350 ms	250k events/s	$42	High (5–7 ops FTE pts)
Kafka + Spark Structured Streaming	Mixed batch + stream, BI pipelines	120 ms	600 ms	180k events/s	$48	Medium-high
Kinesis + Lambda	Bursty workloads, low ops overhead	200 ms	1 800 ms	100k events/s	$115	Low (fully managed)
Pulsar + Flink (self-managed)	Multi-tenant, geo-distributed	90 ms	410 ms	220k events/s	$38	Very high
MSK + ksqlDB	Quick prototyping, simple SQL transforms	250 ms	900 ms	80k events/s	$95	Low-medium

A few things worth noting. The Pulsar + Flink combination has the best raw cost efficiency, but “very high operational complexity” is not a warning to wave away: geo-distributed Pulsar clusters require deep expertise in BookKeeper and broker topology, and the talent pool is significantly thinner than Kafka. The $38/TB figure only looks good if you are not paying 0.5 FTE to keep the cluster running.

Kinesis + Lambda has the worst cost per TB because Lambda’s per-invocation model adds up quickly at scale, and the 1 MB/s per-shard throughput ceiling means you pay for more shards than you expect once traffic bursts. If you are processing more than 50 GB/day consistently, re-evaluate.

The MSK + ksqlDB column reflects the reality that ksqlDB is built on Kafka Streams, which means every ksqlDB query is a Kafka Streams application under the hood. At small scale this is fine. At high scale you will want to tune parallelism and state store compaction directly, which erodes the SQL simplicity advantage.

Decision framework — when each stack wins

Kafka + Flink (self-managed)

This combination is the right call when throughput exceeds 100k events/s sustained, latency SLOs are below 500 ms p99, and the workload involves stateful logic: sessionization, multi-stream joins, pattern detection over time windows. It also makes sense when you need full control over resource allocation — for example, pinning Flink task managers to specific node types to co-locate with GPU inference steps.

The prerequisite is operational maturity. You need people who understand Kafka partition rebalancing, Flink checkpoint storage backends (RocksDB for large state, heap for small state), and JVM tuning for GC pauses. If your team can staff this, self-managed Kafka + Flink is the most cost-efficient path at high throughput: $42/TB versus $95–115 for managed alternatives. Below 100k events/s or with a team new to the stack, the operational cost outweighs the savings.

The typical architecture pairs Kafka with Confluent Schema Registry for schema enforcement, Flink with a RocksDB state backend for large keyed state, and ClickHouse as the analytical sink for sub-second OLAP queries.

Kafka + Spark Structured Streaming

This is the pragmatic choice for organisations that already run Spark for batch processing. The skill transfer is high: engineers already know PySpark or Scala DataFrames, and Structured Streaming extends that API to continuous processing. The micro-batch model (trigger intervals from 100 ms to several minutes) is easy to reason about and debug.

Structured Streaming makes sense when p99 latency in the 1–3 s range is acceptable, when the pipeline needs to interoperate with an existing Spark data lake (Delta Lake, Iceberg), or when the team size does not justify running two separate processing frameworks. The crossover to Flink is warranted when SLOs drop below 1 s, when complex event-time joins are required, or when exactly-once end-to-end guarantees must span multiple external systems.

Be aware that Spark’s shuffle service adds latency variance: joins that require data redistribution across executors are slower than equivalent Flink operations. Size your trigger interval conservatively and monitor for shuffle spill to disk.

Kinesis + Lambda

Use this when the organisation is committed to AWS, operational burden is the primary constraint, and the traffic profile is bursty rather than sustained. Lambda scales to zero between bursts and handles spikes without pre-provisioned capacity. The shard model is simple to reason about for teams without streaming expertise.

The hard limits to know before committing: 1 MB/s write throughput per shard and 2 MB/s read per shard. At 5 events/s average with peaks to 500 events/s, a single shard handles it. At sustained 50k events/s at 1 KB each, you need 50 shards and are paying $0.015/shard-hour — that adds up to roughly $540/month just for shards before compute. At that point, re-evaluate MSK or even self-managed Kafka. Kinesis is not cost-efficient above ~10 GB/day of sustained ingest.

Lambda cold starts add latency variance. A function that processes Kinesis records in 80 ms warm can take 800 ms cold. For latency-sensitive processing, provision concurrency or use Lambda with Kinesis enhanced fan-out.

Pulsar + Flink (self-managed)

This combination targets large organisations that run many isolated data products on shared infrastructure and need per-tenant isolation at the broker level — something Kafka achieves only through separate clusters (expensive) or strict topic naming conventions (operationally fragile at scale). Pulsar’s multi-tenancy model with namespaces and per-tenant quotas is a genuine architectural advantage in this scenario.

Tiered storage is Pulsar’s other differentiator: topics can offload older segments to object storage (S3, GCS) transparently, which makes retention of months or years of event history economically viable. For use cases that require replaying historical streams alongside current events — for example, model retraining pipelines that need 90 days of event history — Pulsar’s tiered storage is more ergonomic than Kafka’s equivalent (Confluent Tiered Storage or a custom offload solution).

The caveat is real: the operational model is substantially more complex. Pulsar separates message routing (brokers) from storage (Apache BookKeeper). Both layers need tuning and monitoring. Experienced Pulsar operators are rare. Do not underestimate the ramp-up cost.

What we ship in production at abemon

Our default recommendation for new client data platforms has converged on AWS MSK (managed Kafka) with Apache Flink deployed on Kubernetes, Confluent Schema Registry for schema enforcement, and OpenTelemetry for observability across the full pipeline.

The reasoning behind MSK over self-managed Kafka is straightforward: for clients processing under 500 GB/day, the delta between MSK’s $1 800–2 400/month and equivalent self-managed EC2 is roughly $1 200–1 800/month. A single incident where a Kafka broker loses quorum and requires manual recovery can consume 8–12 engineering hours. At a fully-loaded engineering cost of $150/hour, one incident eliminates the self-managed cost advantage for two months. For clients at that scale, we trade approximately 15% higher infrastructure cost for the operational predictability of MSK. We revisit this decision at the 1 TB/day threshold, where the economics shift meaningfully toward self-managed Redpanda.

For the processing layer, Flink on Kubernetes gives us resource isolation per pipeline, reproducible deployments via Helm, and the ability to co-locate pipelines with other Kubernetes workloads (inference services, API backends) without separate cluster management. We run Flink in session mode for development and application mode for production — each production job gets its own JobManager, which eliminates the blast radius of a single misbehaving job taking down shared infrastructure. State backends: heap for pipelines with small keyed state (under 2 GB), RocksDB for anything larger.

OpenTelemetry is the observability layer we standardise on across all pipeline components. Kafka JMX metrics, Flink metrics, custom business metrics (events processed per transaction, reconciliation match rate) — all flow through the OTel collector into Prometheus and are visualised in Grafana. This gives clients a single pane of glass rather than four separate monitoring systems. For clients who already have Datadog or New Relic, the OTel exporter routes to their existing platform with minimal friction.

One architectural decision we have revisited recently: Schema Registry placement. We initially ran Confluent Schema Registry as a separate service. For clients who are all-in on AWS, we have started using AWS Glue Schema Registry instead — it integrates natively with MSK, handles IAM-based authentication without extra token management, and eliminates one more service to operate. The tradeoff is vendor lock-in and slightly fewer serialisation format options (Avro and JSON Schema, but no Protobuf as of early 2026). Worth it for most clients; not worth it if portability is a hard requirement.

The pipelines we build pair naturally with our real-time payment reconciliation work — streaming architectures are frequently the infrastructure layer underneath financial event processing.

Common production failures (and what they teach you)

Every streaming stack eventually fails in production. Here are six failure patterns we have directly observed across client deployments, with the operational lesson each one produced.

Consumer lag spike with no auto-scaling. A client’s Kafka consumer group fell 8 hours behind during a traffic spike. The processing logic was correct, but there was no lag-based auto-scaling configured — only CPU and memory metrics drove the horizontal pod autoscaler. By the time CPU spiked, the backlog was already hours deep and recovery took six hours of degraded service. Lesson: consumer lag growth rate is the primary SLO metric for streaming consumers, not CPU or memory. Configure HPA on consumer lag directly using KEDA (Kubernetes Event-driven Autoscaling) with the Kafka scaler. Set scale-up thresholds at 50k lag events, not at 80% CPU.

GC pause exceeding max.poll.interval.ms. A Java-based consumer using the default max.poll.interval.ms of 5 minutes hit a 6-minute stop-the-world GC pause during a full heap collection. Kafka’s consumer group coordinator timed out the consumer, triggered a rebalance, and the partition was reassigned mid-processing. The consumer resumed, reprocessed the same batch, and the downstream sink received duplicate records — which this particular sink was not designed to handle idempotently. Lesson: max.poll.interval.ms must be set above your worst-case GC pause duration. Profile your JVM under load, identify the p99.9 GC pause, and set max.poll.interval.ms to at least 2× that value. Better: switch to G1GC or ZGC, which dramatically reduce max pause time, and reduce poll batch size to limit per-poll processing time.

Exactly-once overhead that wasn’t worth it. An early pipeline design used Kafka transactional producers and Flink’s exactly-once checkpointing end-to-end to a PostgreSQL sink. Correct, but the throughput was 22% lower than equivalent at-least-once processing, and checkpoint overhead added 800 ms to p99 latency. The sink was a reporting table that supported ON CONFLICT DO UPDATE upserts keyed on a stable event ID. We switched to at-least-once delivery with idempotent upserts at the sink — the result was semantically identical (no duplicates visible to consumers), 22% higher throughput, and sub-200 ms p99. Lesson: exactly-once coordination is expensive; at-least-once plus an idempotent sink achieves the same observable behaviour for the majority of use cases.

Schema incompatibility silently breaking a consumer. A producer team deployed a schema change — they renamed a field and added a new required field without a default. The schema registry rejected the change under backward compatibility rules, so they switched the subject to NONE compatibility mode temporarily “just to test.” They forgot to revert it. A consumer deployed two days later with the old schema started deserialising records with missing fields, substituted nulls, and wrote null values into a PostgreSQL column with a NOT NULL constraint. The consumer began throwing exceptions, routed records to the DLQ, and the DLQ filled to 2 million records before anyone noticed — because the alert threshold was 500k. Lesson: never change schema compatibility mode in production without a review gate. Set DLQ alerts at 10k records (or 5 minutes of normal DLQ ingest rate), not arbitrary large numbers.

Checkpoint storage becoming the bottleneck. A Flink job with RocksDB state backend was checkpointing to S3 every 60 seconds. State size grew to 120 GB as the team added new operators without reviewing state lifetime. Checkpoint duration grew from 8 seconds to 4 minutes, which exceeded the checkpoint interval — Flink started queuing checkpoints and eventually failed the job when the maximum concurrent checkpoint count was reached. The job had to be restarted from the last complete checkpoint, losing 4 minutes of processing. Lesson: monitor Flink checkpoint duration as a first-class metric and alert when it approaches 50% of the checkpoint interval. Implement state TTL on all keyed state to bound state size. For large state, consider incremental checkpoints (RocksDB supports this natively in Flink) which checkpoint only changed state rather than full snapshots.

Network partition causing duplicate topic creation. During a Kubernetes node eviction event, a Kafka Streams application was restarted mid-run. The application’s internal repartition topics — which Kafka Streams creates automatically with a hash-based naming convention — were recreated with different partition counts because the application configuration had been updated between deployments (a change in the number of stream threads). The old topics were not cleaned up. For six hours, two sets of repartition topics coexisted with different data, and the aggregation results were partially correct — the worst kind of data quality failure, because it passed basic sanity checks. Lesson: treat Kafka Streams internal topic configuration as infrastructure state. Version it, validate it before deployment, and build cleanup automation for orphaned internal topics.

Architecture patterns that work in production

The streaming ecosystem has consolidated around a few battle-tested patterns. Understanding when to apply each one saves months of rework.

Event streaming with Kafka or Redpanda. This is the backbone of most production streaming architectures. Kafka remains the industry standard for high-throughput, durable event streaming. Redpanda has emerged as a compelling alternative: it’s API-compatible with Kafka but written in C++ with no JVM dependency, which means simpler operations and lower tail latency. For teams starting fresh without existing Kafka expertise, Redpanda is worth serious evaluation.

The core pattern is producers publishing events to topics, consumers reading from those topics at their own pace. The durability guarantee — events are persisted to disk and replicated — is what makes this architecture resilient. If a consumer goes down, it picks up where it left off. No data loss.

Change Data Capture (CDC) with Debezium. When the source of truth lives in a relational database and you need to stream changes without modifying application code, CDC is the answer. Debezium reads the database’s transaction log (the WAL in PostgreSQL, the binlog in MySQL) and publishes each row-level change as an event to Kafka. The application doesn’t know or care that its writes are being streamed. This is particularly valuable for legacy systems where modifying the application to produce events directly is impractical or risky.

Stream processing with Flink or Spark Structured Streaming. Once events are flowing through Kafka, you often need to transform, enrich, aggregate, or join them in flight. Apache Flink is the strongest option for true stream processing: it handles event-time semantics, windowing, and stateful processing with a maturity that other frameworks haven’t matched. Spark Structured Streaming is a reasonable choice if your team already has Spark expertise and your latency requirements are in the low-seconds range rather than sub-second.

For simpler transformations — filtering, mapping, lightweight enrichment — Kafka Streams or even a consumer application with in-process logic may be sufficient. Not every pipeline needs Flink.

The reference architecture we deploy most often looks like this: source systems produce events to Kafka (or Debezium captures changes from databases into Kafka), Flink processes and enriches the streams, and the results land in both an operational data store (PostgreSQL, Redis) for serving and an analytical store (ClickHouse, BigQuery) for reporting.

Data quality in streaming

Data quality in batch pipelines is already hard. In streaming, it’s harder because you lose the luxury of reprocessing an entire dataset before serving it. Three challenges dominate.

Schema evolution. Events will change shape over time. Fields get added, types change, optional fields become required. Without schema management, downstream consumers break silently. Use a schema registry (Confluent Schema Registry or Apicurio) and enforce compatibility rules. Backward compatibility — new schemas can read old data — is the minimum. Full compatibility is better. Avro and Protobuf handle schema evolution far more gracefully than JSON. If you’re starting a new pipeline, choose one of them. Effective schema management is one pillar of a broader data governance strategy that every data team should establish early.

Late arrivals and out-of-order events. In distributed systems, events don’t arrive in order. A mobile device might buffer events during a network outage and flush them minutes or hours later. Stream processing frameworks handle this through watermarks — a declaration of “I believe all events up to timestamp T have arrived.” Events arriving after the watermark are late. You need a strategy for them: drop them, process them into a separate correction stream, or extend your watermark tolerance at the cost of higher latency.

The pragmatic approach is to define a lateness tolerance based on your domain. For logistics tracking, 5 minutes covers most network delays. For IoT sensors, 30 seconds may suffice. For financial transactions, you may need hours of tolerance with a correction mechanism. For a concrete example of how latency tolerance applies in a financial context, see our guide on real-time payment reconciliation.

Exactly-once semantics. The holy grail of streaming is processing each event exactly once: no duplicates, no drops. Kafka supports exactly-once within its ecosystem through idempotent producers and transactional consumers. Flink supports exactly-once with checkpointing. But exactly-once across system boundaries — from Kafka through Flink to an external database — requires idempotent writes on the sink side. Design your sink operations to be idempotent (upserts, not inserts) and you get effective exactly-once without the complexity of distributed transactions.

Observability for pipelines

A streaming pipeline without observability is a pipeline waiting to fail silently. The operational characteristics of streaming systems are fundamentally different from request-response services, and they require purpose-built monitoring.

Consumer lag is the single most important metric. It measures how far behind a consumer is from the latest event in the topic. Healthy lag is near zero and stable. Growing lag means the consumer can’t keep up with the production rate. Sudden lag spikes indicate processing failures or downstream bottlenecks. Monitor lag per consumer group, per partition, and alert on both absolute values and rate of change. Burrow or Kafka’s built-in metrics exposed through JMX are the standard tools.

Throughput metrics should be tracked at every stage: events produced per second, events processed per second, events written to sinks per second. Discrepancies between stages indicate data loss or accumulation. A processing stage that receives 10,000 events per second but only writes 9,500 to the sink has a 5% drop rate that needs investigation.

Dead letter queues (DLQs) are essential for production resilience. When an event can’t be processed — malformed data, schema mismatch, transient downstream failure — it should be routed to a DLQ rather than blocking the pipeline or being silently dropped. Monitor the DLQ growth rate. A healthy pipeline has a near-empty DLQ. A growing DLQ is an early warning of data quality issues upstream. Build tooling to inspect, replay, and resolve DLQ events. They will be needed.

End-to-end latency measures the time from event production to availability in the sink. Track percentiles (p50, p95, p99), not averages. A pipeline with 100ms average latency but 30-second p99 latency has a tail problem that averages hide.

Practical implementation guide

Based on our experience deploying streaming architectures for clients across logistics, hospitality, and retail, here is the implementation sequence we recommend.

Phase 1: Foundation (weeks 1-3). Deploy Kafka or Redpanda with a minimum of three brokers for fault tolerance. Set up a schema registry. Define your topic naming conventions and partitioning strategy early — changing them later is painful. Establish monitoring from day one: consumer lag, broker health, disk usage.

Phase 2: First pipeline (weeks 3-6). Pick one use case. The simplest, highest-value data flow. Implement a producer, a consumer, and a sink. No stream processing yet — just move data reliably from point A to point B. This forces you to solve the operational fundamentals: deployment, configuration management, secret handling, log aggregation.

Phase 3: Stream processing (weeks 6-10). Introduce Flink or your chosen processing framework for the first use case that requires transformation or enrichment. Start with stateless operations (filter, map, enrich from a lookup table). Graduate to stateful operations (windowed aggregations, joins) only when you have operational confidence.

Phase 4: Operationalize (weeks 10-14). Build the DLQ pipeline. Implement alerting on lag, throughput, and error rates. Create runbooks for common failure scenarios: broker failure, consumer rebalance, schema incompatibility. Run a failure injection exercise. Your pipeline will fail in production. The question is whether your team knows how to respond.

The most common mistake we see is teams trying to jump to Phase 3 or 4 before Phase 1 is solid. A streaming architecture built on a shaky operational foundation will cause more incidents than it solves business problems. Get the fundamentals right first.

For a deeper dive into Kafka and Flink applied to logistics operations, see our Kafka-Flink implementation guide. If your team needs hands-on support designing and operating these pipelines, our data engineering service covers everything from initial assessment through production operations.

Sources

Confluent Platform Documentation — Confluent
Apache Flink Documentation — Apache Flink
Apache Spark Structured Streaming Programming Guide — Apache Spark
Amazon Kinesis Data Streams FAQs — AWS
Amazon MSK Pricing — AWS
Apache Pulsar Documentation — Apache Pulsar
Google Dataflow Performance and Scalability — Google Cloud
Datadog State of Stream Processing 2024 — Datadog

Frequently asked questions

Kafka vs Pulsar vs Kinesis in 2026 — which should I choose?: Kafka (or Redpanda) remains the default for teams that control their own infrastructure: largest ecosystem, best tooling, proven at every scale. Pulsar's multi-tenancy and tiered storage are genuine advantages for large orgs running many isolated pipelines on shared clusters — not worth the operational cost otherwise. Kinesis is the right call only when you're all-in on AWS and want zero operational burden; the 7-day retention cap and 1 MB/s per-shard limits bite you faster than expected.
When should I use Flink instead of Spark Structured Streaming?: Use Flink when you need true event-time processing with low latency (sub-second), complex stateful logic (pattern detection, session windows), or exactly-once guarantees across system boundaries. Spark Structured Streaming is the pragmatic choice when your team already runs Spark, latency in low seconds is acceptable, and you want to reuse existing data engineering skills — it handles 80% of near-real-time workloads well. The crossover point is roughly when your SLO drops below 1 second or your windowing logic involves joins across multiple streams.
What end-to-end latency can I realistically achieve?: With Kafka + Flink on dedicated hardware: p50 < 50 ms, p99 < 200 ms is consistently achievable. With MSK + Flink on Fargate: add 50–100 ms due to network hops. With Spark Structured Streaming in micro-batch mode (trigger interval 500 ms): p99 in the 1–3 s range. Kinesis + Lambda: 500 ms–2 s depending on batch window. Budget an extra 50–150 ms for each external enrichment lookup (Redis, Postgres) on the hot path.
How do I size a Kafka cluster for a given event throughput?: Start with: (peak events/s × avg event size × replication factor) / per-broker disk throughput. For 100 K events/s at 1 KB each with RF=3, you need ~300 MB/s write throughput — two to three m6i.2xlarge brokers cover this comfortably. Add 20% headroom for rebalancing and consumer catch-up. Partition count: target 1–4 partitions per consumer thread, with a ceiling of 200 partitions per broker before operational complexity rises steeply. Monitor disk I/O saturation and network bandwidth before CPU.
Is ksqlDB ready for production in 2026?: Yes, for well-defined use cases: filtering, lightweight transformations, simple aggregations, and CDC event routing. It is not the right tool for complex stateful joins, large state stores, or sub-100 ms latency requirements. The SQL interface lowers the barrier for data analysts, but the operational model (embedded Kafka Streams under the hood) can surprise teams when they need to tune parallelism or manage state compaction. Use it when the simplicity tradeoff is worth it; reach for Flink when the logic gets complex.
What is the cost difference between MSK and self-managed Kafka?: In our deployments, AWS MSK for a 3-broker cluster handling ~100 K events/s costs $1 800–2 400/month including storage and data transfer. Equivalent self-managed Kafka on EC2 (3 × m6i.2xlarge + EBS) runs $500–700/month. The 3–4× MSK premium buys automated patching, broker replacement, and CloudWatch integration. For teams without dedicated Kafka expertise, MSK often pays for itself in avoided incident hours. For teams with operational maturity, self-managed Redpanda on Kubernetes cuts costs further.
How do I handle exactly-once semantics across Kafka and an external database?: True distributed exactly-once requires the sink to participate in Kafka's transaction protocol, which most databases do not support natively. The practical pattern: use Flink checkpointing with Kafka exactly-once producers, and design all sink writes as idempotent upserts keyed on a deterministic event ID. This gives you effective exactly-once at the application level without distributed transaction overhead. For PostgreSQL sinks, ON CONFLICT DO UPDATE with a stable primary key derived from the event is the standard implementation.
What monitoring stack works best for streaming pipelines?: Prometheus + Grafana is the baseline for broker metrics (JMX exporter for Kafka, built-in metrics for Redpanda), consumer lag (Burrow for Kafka, native for Redpanda), and Flink job metrics (via the Flink Prometheus reporter). Add alerting on: consumer lag growth rate (not just absolute value), DLQ event rate, checkpoint duration (Flink), and end-to-end synthetic event latency. The synthetic canary pattern — a known event injected every 60 seconds and verified at the sink — catches systemic failures no individual metric surfaces.

Sources

Confluent — Confluent Platform Documentation
Apache Flink — Apache Flink Documentation
Apache Spark — Apache Spark Structured Streaming Programming Guide
AWS — Amazon Kinesis Data Streams FAQs
AWS — Amazon MSK Pricing
Apache Pulsar — Apache Pulsar Documentation
Google Cloud — Google Dataflow Performance and Scalability
Datadog — Datadog State of Stream Processing 2024

About the author

abemon engineering

Engineering team

Multidisciplinary engineering, data and AI team headquartered in the Canary Islands. We build, deploy and operate custom software solutions for companies at any scale.

Follow: LinkedIn GitHub

Articles

Real-Time Data Pipelines 2026: Kafka vs Flink vs Spark

Do you actually need real-time?

Stack benchmarks: latency vs cost (from production)

Decision framework — when each stack wins

Kafka + Flink (self-managed)

Kafka + Spark Structured Streaming

Kinesis + Lambda

Pulsar + Flink (self-managed)

What we ship in production at abemon

Common production failures (and what they teach you)

Architecture patterns that work in production

Data quality in streaming

Observability for pipelines

Practical implementation guide

Sources

Frequently asked questions

Sources

Tags

About the author

Related articles

How We Built a Real-Time Data Pipeline with Kafka and Flink

Apache Kafka for the Mid-Market: Practical Implementation

Event-Driven Architecture for Logistics Operations