Stateful stream processing for Kafka, built to scale up — not out.
Cut the distribution tax: fewer moving parts, simpler operations, more throughput.
StoatFlow is a JVM stream processing library with the Kafka Streams DSL, deployed as a single replica. Built on modern Java 25 — virtual threads, structured concurrency, FFM. Lane parallelism no longer bounded by partition count. Exactly-once semantics, no rebalancing, less serialization, no state migration. Trades horizontal scaling for functional and operational simplicity — better performance at lower cost.
What we set out to fix
Today's choices for stream processing on the JVM are Kafka Streams and Apache Flink — innovative projects that shaped a decade of real-time systems. Both are built to scale out, distributing work across a cluster of instances — an architecture that carries a heavy, permanent complexity. We call this the distribution tax: the cost for scale-out you don't need — paid in infrastructure, staff time, and performance.
The problem
Hard to build. Harder to run.
Stateful joins, exactly-once, watermarks on out-of-order streams, sub-second latency — each a deep practice. Production adds rebalance storms, restart loops, checkpoint failures, state migrations that miss SLAs.
Every layer is a decision.
Plain Java wiring or a micro-framework? Which Kafka client knobs to tune — and what about RocksDB? Deploy on Kubernetes without downtime? Avoid or mitigate rebalances? StatefulSets, PVs, static group membership and/or standby replica?
Most workloads don't need to scale out.
Kafka Streams and Flink scale horizontally — that's where their architectural and operational complexity comes from. Workloads that fit on one modern machine pay that tax for capacity they'll never use.
Our response
Exceptional DX, modern stack, rich tooling
Kotlin and Java APIs on JDK 25 — virtual threads, structured concurrency, FFM. Gradle/Maven plugin, TopologyTestDriver, batteries-included runtime.
Fast & efficient stream processing
Single-replica model, virtual-thread parallelism, in-memory repartitioning — measurably lower CPU, memory, and latency on the same hardware.
Simple & reliable operations
No rebalancing, no state migration. Health checks, metrics, debug endpoints out of the box. 400 ms / 800–1200 ms restarts on benchmarked hardware.
New capabilities
Flink-style watermarks and timers, scheduled sources, side outputs — with more on the roadmap.
Single-instance architecture
A new approach to stream processing
StoatFlow is a single-instance stream-processing runtime. Parallel processing on modern JVM primitives, designed to scale vertically.

Architectural mechanics
Distinctive capabilities under the hood — what makes the single-replica model work in production.
Key-Affinity Lanes
Records with the same key always route to the same virtual-thread lane via consistent hashing. Per-key ordering is guaranteed. Lane count is decoupled from Kafka partition count — parallelism scales with cores, not partitions.
Commit Barriers
Periodic barriers (lightweight markers) sweep through every lane. When all lanes align on a barrier, changelogs and sink records commit atomically with consumer offsets in one Kafka transaction for exactly-once semantics.
Global State
Backed by RocksDB, or in-memory stores, all state is accessible from any virtual thread, by any key. Layered with epoch buffers acting as caches, it allows thread-safe access throughout.
Virtual-Thread Parallelism
Project Loom virtual threads on JDK 25. Hundreds to thousands of concurrent lanes on a single JVM. External I/O (REST calls, database queries, AI inference) runs naturally on virtual threads, no async-framework ceremony.
In-Memory Repartitioning
Key-changing operations (groupBy, fk joins, ...) route records through in-memory queues to new lanes. No repartition topics, no broker round-trips, no extra serialization, no network IO.
Thread-safe by design
Atomic read-modify-write on every store, mostly lock-free inter-thread message passing, plus a KeyLockManager utility for multi-key / multi-store atomic operations in custom Processors.
The full stream processing surface
Compute on continuous event streams — typically Kafka topics — split between stateless transformations (map, filter, route), stateful operations (joins, aggregations, sessions) over time, and async IO (enrichment, RPC). StoatFlow is built on these primitives.
Primitives
Streams / records
Continuous sequences of events flowing through the application.
Sources & sinks
Connectors that bring events in from and push results out to external systems (Kafka topics, etc.).
Operators / processors
Functions that transform, filter, enrich, aggregate, join, or route events.
Keys / partitioning
Groups related events together, ensures key-based ordering, determines where state lives.
State stores
Durable local or managed state used for counts, joins, aggregations, deduplication, session tracking.
Time / windows / timers
Primitives for reasoning about event time, late data, periodic actions, and bounded state.
What it solves
Stateful processing
Keeping, mutating, and reasoning about state across an unbounded stream — counts, joins, sessions, materialised views.
Fault tolerance & recovery
Surviving crashes, broker restarts, and node failures without data loss. State rebuilds correctly from durable logs.
Consistency & exactly-once semantics
Producing each output exactly once across crashes and retries. Atomic commit of state, sinks, and offsets.
Time handling & out-of-order data
Reasoning about event time vs. processing time. Watermarks, windows, and policies for late-arriving records.
Scaling & parallel execution
Spreading work across many threads, cores, or machines while preserving per-key ordering and correctness guarantees.
Operational model & runtime coordination
Deployment, restarts, rebalancing, health, metrics — the moving parts required to run a stream processing app in production.
Faster, leaner, on the same machine.
StoatFlow vs. Kafka Streams across CPU, memory, latency, and state restoration —
single 8-vCPU machine, EOS¹ and ALO² modes.
¹ EOS — Exactly-once semantics·
² ALO — At-least-once semantics
Hetzner ccx33 · 8 vCPU · 32 GB · 2026-04-22
CPU utilisation
Stateless (simple)
parity
Word count
27% less
Stateless (advanced)
2.3× less
Stateful joins
1.5× less
Container memory
Stateless (simple)
17% more
Word count
31% less
Stateless (advanced)
2.4× less
Stateful joins
7.8× less
True E2E P95 latency
Stateless (simple)
17% higher
Word count
3% lower
Stateless (advanced)
1.9× lower
Stateful joins
13.6× lower
State restoration
Restoration time
1.45× faster
On-disk state
35% less
Throughput parity throughout. Lower is better.
Curious what's behind each scenario? The full report walks every benchmark — topology, infrastructure stack, serdes (String / Avro / Protobuf), load rates, and event size distributions — so you can compare against the workloads you actually run.
100% Kafka Streams DSL and beyond
Drop into the DSL you already use — with the primitives Kafka Streams doesn't ship.
Same DSL, drop in
- KStream, KTable, joins (PK, FK, window)
- count, reduce, aggregate, cogroup
- Tumbling, hopping, session windows + grace + suppression
- Versioned state stores
- Processor API
- Interactive queries
Existing topologies port with a dependency swap and a config cleanup.
Beyond Kafka Streams
- Timers (event-time + processing-time)
- Watermarks + idleness alignment
- Scheduled sources (interval, cron)
- Async I/O for external calls (REST, DB, AI)
- Atomic store operations (compute, merge)
- KeyLockManager — multi-key atomic sections
Side outputs + hot-standby HA on the roadmap.
Features beyond Kafka Streams
On top of the DSL — Flink-inspired primitives and StoatFlow-native utilities.
Event-time & wall-clock timers in any Processor
Per-key event-time or processing-time timers from any Processor. Each timer fires in the same lane as the key's records, serialised with process() — so onTimer has full read/write state access without locks.
Common use: retry a failed enrichment after a delay, expire idle keys, time out incomplete sessions.
class DelayedEnrich implements Processor<String, Order, String, EnrichedOrder> {
private static final long RETRY_MS = 10_000;
private ProcessorContext<String, EnrichedOrder> ctx;
private TimerService<String> timers;
private KeyValueStore<String, Order> pending;
private ReadOnlyKeyValueStore<String, Customer> customers;
@Override public void init(ProcessorContext<String, EnrichedOrder> ctx) {
this.ctx = ctx;
this.timers = ctx.timerService();
this.pending = ctx.getStateStore("pending");
this.customers = ctx.getStateStore("customers");
}
@Override public void process(Record<String, Order> record) {
Customer c = customers.get(record.value().customerId());
if (c != null) {
ctx.forward(new Record<>(record.key(), enrich(record.value(), c), record.timestamp()));
return;
}
// Cache miss — stash the order and retry 10s later
pending.put(record.key(), record.value());
timers.registerProcessingTimeTimer(record.key(), timers.currentProcessingTime() + RETRY_MS);
}
@Override public void onTimer(long ts, String key, TimerContext<String, EnrichedOrder> tctx) {
Order order = pending.delete(key);
Customer c = customers.get(order.customerId());
tctx.forward(key, c != null ? enrich(order, c) : fallback(order));
}
}
// Wire into the topology — pass the store names the Processor declares
orders.process(DelayedEnrich::new, "pending", "customers");Runtime, plugins & testing
Most stream-processing libraries hand you an engine and stop. StoatFlow goes further — the scaffolding around it ships as part of the product, not left as homework for every team to re-do.
Batteries-included micro-runtime
Health checks, metrics, debug endpoints, and config defaults distilled from years of running Kafka apps in production — already wired in.
Gradle & Maven plugins
The Gradle convention plugin ships application + shadow + JDK 25 toolchain. Optional Docker build via Jib; optional GraalVM native image — opt in with a flag.
TopologyTestDriver
In-memory, broker-free test driver — familiar to Kafka Streams users — extended for timers, scheduled sources, and watermark control.
Why StoatFlow?
Engineering choices grounded in years of running stream processing applications in production — and the practitioner background behind every architectural decision.
Better performance at lower costs
Benchmarked stateful workloads vs Kafka Streams — same hardware, throughput parity:
- 4.4–7.8× less container memory
- 1.5–3.4× less CPU
One single-instance deployment per app — no cluster, no requested by unterutilised k8s resources, no standby replicas. The cloud bill follows the resource curve.
Ship without specialist expertise
Production defaults baked in. Health checks, metrics, debug endpoints, Kubernetes probes — out of the box.
Consumer-group sizing, rebalance tuning, standby-replica strategy, checkpoint configuration — decisions you don't have to make.
Existing JVM teams productionize stream processing without first becoming Kafka, Kafka Streams, or Flink specialists.

Built by Kafka practitioners
— Hartmut Armbruster
Speaker
3× Current — the leading Apache Kafka conference (Austin 2024 · London 2025 + 2026). Also Berlin Buzzwords (2026), WeAreDevelopers World Congress (Berlin 2024 + 2026), Big Data Conference Europe (Vilnius 2024).
Author
Kafka Streams Topology Design — an open standard for visualising and documenting Kafka Streams application architectures.
Industry
18 years engineering, with Kafka in production at HSBC, NEX (CME Group), Raiffeisen Switzerland, and Deutsche Bahn.
Frequently asked questions
Your topology code ports over — it's a dependency swap and a config cleanup, not a rewrite. What changes is the runtime: single replica per app instead of multiple instances. From that one architectural change, you get measurably lower latency on stateful workloads, better resource utilisation (one machine often does the work of several KS instances — directly reducing infrastructure costs), and simpler operations (no rebalancing during deploys, no state migration, no standby-replica sizing). Plus extensions KS doesn't ship: Flink-style watermarks and timers, scheduled sources, side outputs, with more on the roadmap.
StoatFlow lets you ship without first becoming a Kafka expert. Production-grade config defaults are baked into the runtime; consumer-group sizing, partition counting, standby-replica strategy, and rebalancing tuning are decisions you don't have to make. The runtime ships with health checks, metrics, and debug endpoints out of the box — you spend your engineering time writing topology code, not building scaffolding.
Flink is the primary choice for non-Kafka sources and sinks, analytics, ML, or unified streaming and batch workloads. For Kafka-native stream processing applications, Flink's cluster overhead (JobManager, TaskManagers, checkpoint tuning) or a managed service's pricing usually dominates the cost structure — if your sustained throughput lives under the StoatFlow ceiling, Flink is usually the wrong trade.
Source escrow. Third-party escrow service holds the source code; it's released to active licensees under a limited maintenance license if StoatFlow ceases operations. Contractually guaranteed in EULA §11. Enterprise contracts can negotiate broader escrow terms.
Running Kafka Streams with multiple replicas doesn't actually make your app highly available — it just spreads the failure modes. StoatFlow takes a different path. Strong fault tolerance: error-handling primitives so individual record failures don't take the app down; state stores with fast incremental restore after crash. Fast in-place upgrades: start-to-process is 400–1000 ms, with zero rebalancing, session timeouts, or state migration. Restoration when needed: 1.45–1.65× faster than Kafka Streams on the same hardware. On the roadmap: hot-standby mode for blue-green deployments — even shorter downtime, seamless failover.
StoatFlow scales — vertically, not horizontally. On a modern single machine, network bandwidth is usually the practical ceiling, not CPU or memory: benchmarks on a Hetzner 8-core VM saw a 200-300 MB/s uncompressed throughput limit before network saturated — in events, ~124K/sec on a 1KB stateless transform, up to ~2.1M/sec output on word-count-style aggregation. High-end infrastructure (96+ cores, faster NICs) hasn't been benchmarked yet — actual headroom there is open. Lane parallelism is decoupled from Kafka partition count, so adding cores translates directly to throughput — and the more complex your stateful topology, the more resources you save vs Kafka Streams (no repartition topics, less serialization, less network I/O).
StoatFlow implements the Kafka Streams DSL. Your StreamsBuilder, your KStream/KTable operations, your Materialized stores all work — what changes is the runtime. Code migration is a dependency swap, an import change, and removing multi-instance configuration (standby replicas, thread counts). For state, the recommended safe path is to reprocess your input topics — direct restoration from existing Kafka Streams changelog topics isn't supported. If you have state that can no longer be reprocessed (source data aged out by retention, lossy transformations upstream), reach out — we'll work through the migration with you.
Stateful stream processing for Kafka, built to scale up — not out.
Start your 30-day free trial after a quick intro call. Full features, no credit card, no commitment. Migrate your first topology in an afternoon.