Kafka in Production: Operational Lessons From a 200-Broker Fleet
Kafka has the reputation of being either rock-solid infrastructure that runs for years untouched or a fragile beast that consumes engineering attention disproportionate to its visible role. The truth, as usual, is more nuanced. Properly operated, Kafka is one of the most dependable components in our stack. Improperly operated, it produces incidents that are hard to diagnose and cascade across consuming services in surprising ways. This article distills four years of lessons from running a fleet that has grown from twelve brokers to two hundred across multiple regions.
The KRaft Migration Was Worth the Pain
Until Kafka 3.3, every cluster required ZooKeeper. ZooKeeper was operationally painful: a separate distributed system with its own consensus protocol, scaling characteristics, and failure modes. Worse, ZooKeeper outages caused Kafka cluster instability even when the brokers themselves were healthy.
KRaft (Kafka Raft) eliminates ZooKeeper by moving metadata management into the brokers themselves using a Raft consensus protocol. We completed our migration to KRaft mode in late 2025 across all production clusters. The benefits have been substantial:
- Cluster startup time dropped from approximately 15 minutes to under 4 minutes for our largest cluster.
- Controller failover, previously a 30-second blip, now completes in under 5 seconds.
- Operational complexity decreased: one system to monitor and upgrade instead of two.
- Per-partition metadata overhead is significantly lower, allowing higher partition counts per cluster.
The migration itself was not trivial. The dual-write phase, where metadata is written to both ZooKeeper and KRaft simultaneously, requires careful coordination. We staged the migration over six weeks per cluster, with explicit rollback checkpoints. Anyone planning this migration should read the upstream documentation thoroughly and test against a representative staging environment.
Partition Count Is Architecture, Not Configuration
The most common Kafka mistake we see is treating partition count as a knob to tune later. It is not. Partition count is a structural decision with consequences that are expensive to reverse.
Too few partitions and your consumer parallelism is capped. Too many partitions and the overhead of metadata, leader election, and replication starts to dominate. Specifically:
- Each partition consumes broker memory for replication state.
- Each partition adds metadata that must be replicated across the cluster.
- Each partition is an additional file handle and seek point on disk.
- Each partition is a unit of leadership that must be balanced.
Our guidance: aim for 1000-3000 partitions per broker as a soft cap. For a 20-broker cluster, that means roughly 30,000-60,000 total partitions across all topics. Plan partition counts at topic creation based on expected peak throughput, not current load. Increasing partition count later is possible but rebalances consumer groups and can break ordering guarantees within keys.
Replication and ISR: Where the Subtleties Live
Replication factor 3 is the standard recommendation, but the details matter more than the number. The key configuration is min.insync.replicas, which determines how many replicas must acknowledge a write before it is considered durable.
For critical data (financial events, audit logs), we use:
replication.factor=3
min.insync.replicas=2
unclean.leader.election.enable=false
The combination means: writes succeed only if 2 of 3 replicas acknowledge, and leadership cannot transfer to an out-of-sync replica during failure. This trades availability for durability. If two brokers fail simultaneously, writes block until at least one returns.
For high-volume telemetry where occasional data loss is tolerable, we use a more permissive configuration:
replication.factor=3
min.insync.replicas=1
unclean.leader.election.enable=true
This trades durability for availability. Writes can continue with only one in-sync replica, and unclean leader election allows recovery from situations where all in-sync replicas fail.
The mistake is using one configuration uniformly. Different data has different durability requirements, and Kafka is flexible enough to express them.
Consumer Lag Is Not One Metric
Monitoring consumer lag is essential, but the default lag metric is misleading at scale. Total lag across a topic does not tell you whether one partition is stalling and the rest are healthy. Average lag is similarly misleading.
Our lag monitoring includes:
- Per-partition lag -- alerts trigger when any single partition's lag exceeds threshold, not when total lag does.
- Lag time, not lag count -- a partition with 100,000 messages backlog might be 30 seconds or 30 minutes behind depending on message size and consumer throughput. We compute lag time by joining offset commits with message production timestamps.
- Consumer group rebalance frequency -- frequent rebalances cause spurious lag and indicate underlying issues with consumer health.
- Lag derivative -- a stable lag of 100,000 messages might be acceptable. A lag growing at 10,000 messages per minute is an incident regardless of absolute value.
Burrow remains our preferred lag-monitoring tool, with custom alerting on top to capture these dimensions.
Disk Layout and IO Patterns
Kafka's performance characteristics are dominated by disk IO patterns. Producers write sequentially to active segments; consumers read sequentially through committed offsets. The kernel page cache handles most reads when consumers are caught up.
This pattern breaks when consumers fall behind. Reads from cold segments require disk IO and evict hot data from page cache. A single misbehaving consumer can degrade performance for all consumers on the same brokers.
Mitigations we have found useful:
Dedicated disks for log segments. Mixing Kafka logs with OS or other application data fragments the page cache and produces unpredictable IO patterns. Each broker has dedicated NVMe storage for log directories.
Multiple log directories per broker. Kafka supports striping partitions across multiple log directories. We use four NVMe devices per broker, with Kafka managing partition assignment. This parallelizes IO and reduces hot-spot impact.
Filesystem choice. XFS consistently outperforms ext4 for Kafka workloads in our testing, particularly for delete operations during log segment cleanup.
Tiered storage for cold data. Kafka 3.6 introduced tiered storage, which offloads older log segments to object storage. For topics with long retention requirements (regulatory data, audit logs), this dramatically reduces local disk requirements. We tier segments older than 7 days to S3.
The Compaction Trap
Log compaction is a useful feature for topics that represent state (current value per key) rather than events. It can also be a source of subtle bugs.
The trap: compaction processes segments in the background, but only after they roll over. A topic with low write volume and large segment size may have records that never compact because the active segment never rolls. We have seen compacted topics grow to surprising sizes because segment.bytes was set too high relative to the write rate.
Our guidance: for compacted topics, set segment.bytes and segment.ms such that segments roll over at least daily under expected load. Monitor the ratio of total log size to unique keys -- if it diverges significantly, compaction is not keeping up.
Producer Configurations That Matter
Producer configurations have outsized impact on both performance and reliability. The defaults are conservative, and most workloads benefit from tuning:
acks=all # Durability
enable.idempotence=true # Exactly-once semantics
max.in.flight.requests.per.connection=5
compression.type=zstd # Best compression ratio
batch.size=131072 # 128KB batches
linger.ms=10 # Wait briefly for batching
retries=2147483647 # Effectively infinite
delivery.timeout.ms=120000 # Bound total retry time
The combination of enable.idempotence=true and acks=all provides exactly-once semantics within a single producer session. The high retries with bounded delivery.timeout.ms handles transient broker issues without requiring application-level retry logic.
Compression is worth special attention. ZSTD provides significantly better compression than gzip or snappy at comparable CPU cost. Network and disk savings are substantial for text-heavy payloads (JSON, logs).
Multi-Region: Don't Unless You Must
Multi-region Kafka deployments are operationally expensive and architecturally complex. MirrorMaker 2 is the standard tool for cross-region replication, but it introduces lag, consumer offset translation challenges, and operational overhead.
Our position: keep Kafka clusters single-region whenever possible. Use application-level patterns (write to local region, read from local region) for geographic distribution. Reserve cross-region Kafka replication for genuine requirements like disaster recovery or data localization compliance.
When multi-region is required, treat MirrorMaker 2 as a first-class service with its own SLOs, alerting, and operational ownership. It is not a "set and forget" component.
The Operational Discipline
Kafka rewards operational discipline. Clusters that are carefully sized, with thoughtful topic design and monitoring, run for years with minimal intervention. Clusters that grow organically without governance accumulate problems that are expensive to fix. The discipline is not glamorous: capacity planning, topic naming conventions, retention policy enforcement, regular partition rebalancing. But it is what separates Kafka from being either reliable infrastructure or a source of recurring incidents.