Benchmarks

StoatFlow vs Kafka Streams 4.1 — four representative workloads, EOS and ALO modes, plus state restoration. Run 2026-04-22 on Hetzner ccx33 (8 vCPU / 32 GB RAM).

These numbers compare a StoatFlow build (EOS Phase 6b, ALO Phase 8) against a stable Apache Kafka Streams 4.1.1 baseline on identical hardware, identical broker, identical workload. The intent is to be useful, not promotional — full topology, load profile, methodology, and the cases where StoatFlow is only at parity or measurably behind Kafka Streams are all in scope below.

All runs originate from the internal benchmark report dated 2026-04-22. The numbers below transcribe that report verbatim.

Setup

Hardware

ComponentValue
Benchmark app nodeHetzner ccx33 — 8 dedicated vCPU, 32 GB RAM
Container limits (app)8 CPU, 16 GB memory
Kafka brokers3× Hetzner cpx32 — 4 shared vCPU, 8 GB RAM, 160 GB SSD each — co-located with the k3s control plane and the monitoring stack (Prometheus, Grafana)
Regionnbg1 (Nuremberg)
Orchestrationk3s on Hetzner Cloud

Kafka & runtime

ComponentValue
Kafka broker4.1.1 (Strimzi 0.50.0 operator)
JDK25 (preview features enabled; required for StoatFlow's virtual-thread runtime)
StoatFlow EOS commit barriercommit-barrier.min-interval-ms=130, adaptive scheduling (post-commit feedback clamps barrier-to-barrier interval between configured min and max)
Kafka Streams EOS commit interval100 ms (default commit.interval.ms)

What's controlled, what's varied

Topology, input data, load profile, partition count, broker, hardware, and run duration are identical across the two runtimes for any given comparison. The only variable is the runtime itself (StoatFlow vs Kafka Streams) and the processing mode (EOS vs ALO).

Each run executes for 20 minutes total: a 10-minute warmup (JIT, GC stabilisation, lag catch-up) followed by a 10-minute measurement window at steady state. Topics are recreated between runs; state is empty at start.

Methodology

Latency

True E2E P50 / P95 / P99 is measured from the downstream-consumer's point of view — the latency a real service reading the output topic would actually observe. A dedicated latency-consumer pod simulates that downstream application: it subscribes to the output topic with isolation.level=read_committed (so a record only becomes visible after the producing transaction commits) and, for every consumed record, computes consumption time − record timestamp. Percentiles are derived over the entire measurement window, not from averages.

The read_committed isolation is deliberate. Under EOS, downstream consumers can't see records before the producing transaction commits — which means the runtime's commit cadence is part of the user-observable latency (StoatFlow's 130 ms minimum barrier interval — adaptively scheduled — and Kafka Streams' default 100 ms commit.interval.ms). An internal produce-time measurement would hide that cost.

The latency-consumer adds ~10–15 % measurement overhead through the extra consume-and-timestamp cycle. Percentile relative comparisons between runtimes hold; absolute latency on a runtime without a latency-consumer would be slightly lower.

Resource utilisation

CPU, JVM heap, container working set, GC pause durations, and thread counts are sampled from Prometheus at 10-second resolution and averaged over the measurement window. GC pause is the per-event JVM Stop-the-World duration, not the cumulative GC time.

Repeatability

Throughput is defined and controlled by the load generators — produced rec/s mirrors the configured input rate, predictable and repeatable across runs. Latency, CPU, memory, and GC numbers in the result tables are sampled from each run and should be read as point estimates rather than statistical means.

Scenario A — Stateless (simple)

Topology

A minimal stateless pipeline: one source topic → mapValues (uppercase) → filter (length > 5) → one sink topic. No state, no repartitioning, no joins. The simplest possible topology — used to measure pure framework overhead.

Data & load profile

PropertyValue
SerdesString / String
Record size~1,024 bytes (fixed)
Partitions12
Distinct keys10,000
Load20,000 msg/s
Total records / run~12.3M

Results — EOS

MetricKSStoatFlowΔNotes
True E2E P50 (ms)100.2124.0+24%KS wins
True E2E P95 (ms)163.7191.8+17%KS wins
True E2E P99 (ms)180.0201.9+12%KS wins
Avg CPU (cores)1.31.30%parity
Avg heap (MB)9362−33%
Avg container mem (MB)432504+17%KS uses less
Avg GC pause (ms)4.33.1−28%
KPI snapshot — Stateless (simple) · EOS · KS = baseline

True E2E P95 latency

17% higher

KS
SF

CPU utilisation

parity

KS
SF

Container memory

17% more

KS
SF

This is the scenario where StoatFlow's EOS commit cadence is most exposed: with no state store cost and trivial per-record work, StoatFlow's 130 ms minimum barrier interval runs a touch slower than KS's 100 ms commit.interval.ms. The ~30 ms cadence delta accounts for essentially all of the observed ~24 ms True E2E P50 gap. ALO drops the barrier cadence entirely and the gap inverts (see below).

Results — ALO

MetricKSStoatFlowΔNotes
True E2E P50 (ms)45.341.9−8%
True E2E P95 (ms)65.361.5−6%
True E2E P99 (ms)75.071.6−5%
Avg CPU (cores)1.21.20%parity
Avg heap (MB)97197+103%KS uses less
Avg container mem (MB)430657+53%KS uses less
Avg GC pause (ms)3.81.3−66%
KPI snapshot — Stateless (simple) · ALO · KS = baseline

True E2E P95 latency

6% lower

KS
SF

CPU utilisation

parity

KS
SF

Container memory

53% more

KS
SF

In ALO, StoatFlow recovers a small latency lead but uses substantially more container memory on this scenario — the in-memory buffer and virtual-thread stack overhead dominate when there's no state store to amortise over. The GC pause delta (−66 %) goes the other way.

Scenario B — Word count

Topology

Canonical stateful aggregation: source → flatMapValues (tokenise on whitespace) → groupBy(word)count (RocksDB) → toStream → output. Implicit repartitioning between groupBy and count: Kafka Streams writes to a Kafka-hosted repartition topic; StoatFlow re-keys in memory.

Data & load profile

PropertyValue
SerdesString / Long
Record size~200 bytes avg (range 50 – 1,000)
Partitions12
Distinct word keys~1,000
Load7,500 msg/s
Total records / run~4.6M
State1 RocksDB store (word counts)

Results — EOS

MetricKSStoatFlowΔNotes
True E2E P50 (ms)155.3120.7−22%
True E2E P95 (ms)197.4191.9−3%parity
True E2E P99 (ms)215.0201.7−6%
Avg CPU (cores)3.02.2−27%
Avg heap (MB)724406−44%
Avg container mem (MB)1,7691,219−31%
Avg GC pause (ms)6.41.1−83%
KPI snapshot — Word count · EOS · KS = baseline

True E2E P95 latency

3% lower

KS
SF

CPU utilisation

27% less

KS
SF

Container memory

31% less

KS
SF

Results — ALO

MetricKSStoatFlowΔNotes
True E2E P50 (ms)39.839.4−1%parity
True E2E P95 (ms)65.054.3−16%
True E2E P99 (ms)72.462.3−14%
Avg CPU (cores)3.51.8−49%
Avg heap (MB)759771+2%parity
Avg container mem (MB)2,2782,323+2%parity
Avg GC pause (ms)3.10.7−77%
KPI snapshot — Word count · ALO · KS = baseline

True E2E P95 latency

1.2× lower

KS
SF

CPU utilisation

1.9× less

KS
SF

Container memory

parity

KS
SF

Scenario C — Stateless (advanced)

Topology

Two independent processing paths over four Protobuf input streams:

  • Path A: user-events + product-updates → merge → flatMap (tag expansion, 1 → N fan-out) → repartition → split by priority → 3 output topics.
  • Path B: orders + click-streams → filter/transform → merge → repartition by customerId → 1 output topic.

Two explicit repartitions, four inputs, four outputs, no state.

Data & load profile

PropertyValue
SerdesProtobuf (all topics)
Record sizeuser-events / product-updates 2 – 5 KB · orders 8 – 15 KB · click-streams 15 – 30 KB
Partitions12 per topic
Load2,100 + 1,400 + 700 + 2,800 = 7,000 msg/s total
Total records / run~4.3M

Results — EOS

MetricKSStoatFlowΔNotes
True E2E P50 (ms)217.6116.4−47%
True E2E P95 (ms)338.4181.0−47%
True E2E P99 (ms)413.9196.5−53%
Avg CPU (cores)2.31.0−57%
Avg heap (MB)480141−71%
Avg container mem (MB)1,252518−59%
Avg GC pause (ms)1.72.9+71%KS wins
KPI snapshot — Stateless (advanced) · EOS · KS = baseline

True E2E P95 latency

1.9× lower

KS
SF

CPU utilisation

2.3× less

KS
SF

Container memory

2.4× less

KS
SF

Results — ALO

MetricKSStoatFlowΔNotes
True E2E P50 (ms)50.337.3−26%
True E2E P95 (ms)79.052.0−34%
True E2E P99 (ms)91.860.8−34%
Avg CPU (cores)2.61.1−58%
Avg heap (MB)810128−84%
Avg container mem (MB)1,654506−69%
Avg GC pause (ms)1.12.6+136%KS wins
KPI snapshot — Stateless (advanced) · ALO · KS = baseline

True E2E P95 latency

1.5× lower

KS
SF

CPU utilisation

2.4× less

KS
SF

Container memory

3.3× less

KS
SF

Scenario D — Stateful joins

Topology

Five-table enrichment chain: customers + products + orders + shipments + reviews. One primary-key join (ordersshipments) plus four foreign-key joins. Output: fully enriched record per order. Ten-plus RocksDB state stores (one per input table plus join intermediates). Avro with Schema Registry.

Data & load profile

PropertyValue
SerdesAvro + Confluent Schema Registry
Record sizecustomers / products 2 – 5 KB · orders 10 – 50 KB · shipments 5.5 – 27 KB · reviews 5.5 – 20 KB
Partitions12 per input topic
Load12 + 18 + 120 + 90 + 60 = 300 msg/s total
Total records / run~184 K
State10+ RocksDB stores

Results — EOS

MetricKSStoatFlowΔNotes
True E2E P50 (ms)1,594.5144.4−91%
True E2E P95 (ms)2,914.9214.2−93%
True E2E P99 (ms)3,297.9242.8−93%
Avg CPU (cores)4.32.9−33%
Avg heap (MB)1,775194−89%
Avg container mem (MB)8,6671,105−87%
Avg GC pause (ms)13.78.2−40%
KPI snapshot — Stateful joins · EOS · KS = baseline

True E2E P95 latency

13.6× lower

KS
SF

CPU utilisation

1.5× less

KS
SF

Container memory

7.8× less

KS
SF

Results — ALO

MetricKSStoatFlowΔNotes
True E2E P50 (ms)267.939.8−85%
True E2E P95 (ms)562.955.0−90%
True E2E P99 (ms)703.863.0−91%
Avg CPU (cores)6.82.0−71%
Avg heap (MB)1,301485−63%
Avg container mem (MB)7,9321,802−77%
Avg GC pause (ms)14.68.0−45%
KPI snapshot — Stateful joins · ALO · KS = baseline

True E2E P95 latency

10.2× lower

KS
SF

CPU utilisation

3.4× less

KS
SF

Container memory

4.4× less

KS
SF

State restoration

Separate benchmark — cold-start recovery from pre-populated changelog topics, no live processing. Measures the wall-clock time and resource cost of materialising a full state set.

Setup

PropertyValue
Stores16 KTable source stores
Partitions per store6
Total records~155.8M
Raw data~16.8 GB
SerdesMixed Avro / String

StoatFlow offers two restoration modes:

  • WriteBatch HIGH_PERF — RocksDB WriteBatch ingestion. Lower memory overhead, recommended default.
  • SST HP buf128 — direct SST file ingestion with a 128 MB per-worker sort buffer. Lower wall-clock time, substantially higher memory ceiling.

Results

MetricKSStoatFlow · WriteBatchStoatFlow · SST
Total time (s)183127 (1.45× faster)111 (1.65× faster)
Aggregate throughput (M rec/s)0.871.231.41
Avg CPU (cores)6.886.177.79
Avg container memory (MB)5,6164,69018,105
On-disk RocksDB (GB)16.510.810.2
State restoration — KS · StoatFlow WriteBatch · StoatFlow SST · 16 stores, 155.8M records, 16.8 GB raw

Total restoration time

KS
183 s
SF · WriteBatch
127 s1.45× faster
SF · SST ingestion
111 s1.65× faster

On-disk RocksDB footprint

KS
16.5 GB
SF · WriteBatch
10.8 GB35% less
SF · SST ingestion
10.2 GB38% less

Avg container memory

KS
5.6 GB
SF · WriteBatch
4.7 GB16% less
SF · SST ingestion
18.1 GB3.2× more

When to use which mode

  • WriteBatch (default) — broadly applicable. Wall-clock 1.45× faster than KS, lower memory than KS, smaller on-disk footprint. The right starting point unless restoration time is the binding production constraint.
  • SST ingestion — when first-byte-after-restart latency matters and the deployment has memory headroom. ~3.2× the container memory of WriteBatch during the restoration window; the same memory profile during steady-state.

Where StoatFlow is at parity or behind

A short list of cases where the numbers don't favour StoatFlow — included so the page reads as data, not as advocacy.

EOS latency on the simplest topology

On Stateless (simple) in EOS mode, Kafka Streams beats StoatFlow on every latency percentile: P50 by 24 %, P95 by 17 %, P99 by 12 %. Root cause: StoatFlow's commit pipeline runs at a 130 ms minimum barrier interval (commit-barrier.min-interval-ms; the adaptive scheduler clamps barrier-to-barrier between the configured min and max via post-commit feedback). KS's default commit.interval.ms=100 produces a slightly tighter cadence on workloads where the commit itself dominates the latency budget. The ~30 ms cadence delta accounts for essentially the entire ~24 ms P50 gap. In ALO mode (no transactions, no barrier cadence) StoatFlow recovers and shows a small latency lead, so the gap is specific to EOS-on-simple-topologies. The minimum barrier interval is a tunable; lowering it further narrows the EOS gap at the cost of commit-pipeline throughput.

Container memory on Stateless (simple) ALO

In ALO mode on the simplest topology, StoatFlow uses +53 % container memory (657 MB vs 430 MB). GC pause time is 2.9× lower in the same comparison — a different point on the same trade-off — but the working-set figure is larger, and a memory-constrained deployment should know that. The delta does not appear on more substantial workloads, where state-store memory dominates.

GC pause on Stateless (advanced)

GC pause averages run +71 % in EOS and +136 % in ALO vs Kafka Streams on this scenario (2.9 ms vs 1.7 ms; 2.6 ms vs 1.1 ms). Absolute pause times remain under 3 ms in both runtimes, so the relative figure overstates the practical impact, but it's a genuine regression on this workload shape (high-fanout flatMap on relatively large records).

Not yet benchmarked

The following are explicitly out of the current measurement set. Numbers here are not just unfavourable — they are absent.

  • Scaling beyond 8 vCPU. All numbers above are on a single Hetzner ccx33. Larger boxes (16+ vCPU, 64+ GB) are extrapolated from microbenchmarks of the lane dispatcher and have not been validated end-to-end at scale.
  • Custom processor DAGs with deep fan-out. Only the four scenarios above were measured. Topologies with many sub-topologies (10+) may behave differently.

StoatFlow at the time of these runs is in active development (EOS Phase 6b, ALO Phase 8); Kafka Streams 4.1.1 is a stable production release. Later StoatFlow phases may improve or regress against these numbers.

What's next

Two benchmark expansions are planned for upcoming runs.

Higher-load runs on AWS EKS

16-core compute-optimized nodes for both the application tier and the Kafka broker tier, with substantially higher network and disk-IO bandwidth than the current Hetzner ccx33 baseline. Goal: stress sustained throughput beyond the single-machine measurements above, validate the vertical-scale assumption past 8 vCPU end-to-end (currently extrapolated from microbenchmarks), and push past the ~380 MB/s broker-fetch ceiling that bounds State restoration today.

Additional benchmark apps

Coverage of all window types (tumbling, hopping, sliding, session) and stream-stream joins — neither is exercised today; the current set covers stateless transforms, key changes, aggregations, and primary-key + foreign-key table joins. Plus more complex topologies modelling real-world use cases, driven by production-like test-data workloads — record-size distributions, key cardinality, arrival patterns, and foreign-key relationships that actually resolve at join time (not random keys that mostly miss) — instead of the synthetic load profiles used in the current runs.