How four everyday Mac Mini M4 machines reveal a clear latency-and-throughput gap
Modern AI teams need solutions that keep data on-prem, cut cloud spend, and still deliver accurate, instantaneous responses. No matter your industry or use case—whether you’re powering a private patient-data copilot for a hospital or running a maintenance-manual assistant on edge devices—the core question is the same:
“Which platform actually delivers real-time performance on hardware I can buy (or already own) today?”
To answer that, we’re launching a public benchmark series. Each instalment measures webFrame against another framework on identical Apple-silicon hardware, using publicly available models and repeatable prompts.
For this first post we looked at a widely used open-source project that, like webFrame, distributes LLM inference across local Apple-silicon devices. (We’re keeping names out of the blog to stay focused on the data, but we’ll happily share full details with anyone who wants to reproduce the tests.)
webFrame is the webAI tool that lets you deploy large AI models across multiple machines on your own network.
Both frameworks were tested in a distributed setting on the same four-node Apple-silicon cluster, using the same prompts and quantised models. We captured two real-world metrics: Time to First Token (TTFT) and tokens per second (tok/s).
Figures below use the fastest configuration the comparison project supports (Thunderbolt bridge).
webFrame turns out tokens significantly faster—and, for the larger model, starts responding sooner—than the open-source baseline.
These results position webFrame as the clear leader in distributed LLM inference—we’re not aware of any framework, inside or outside Apple, that matches its TTFT or tok/s on identical four-node Apple-silicon clusters.
Distributed inference on commodity Macs isn’t just possible—it’s fast. In this first public head-to-head, webFrame achieves lower latency and dramatically higher throughput than a leading open-source alternative on the same four-node cluster. If you’re evaluating on-prem LLM options, the data are clear: sharding with webFrame lets you keep data local without sacrificing speed and responsiveness.
Benchmark Series #2 is already in the works—stay tuned!