Benchmark Series #1 — webFrame vs. a popular open source project

Key Takeaways

  • webFrame leads on speed: In identical four-node tests, it generated roughly 3× more tokens per second and cut first-token latency by a third on Llama-3 70B.
  • On-prem without trade-offs: Sub-2-second responses show you can keep sensitive data local, avoid cloud costs, and still deliver real-time user experiences.
  • Flexible networking: webFrame runs on Ethernet, Thunderbolt bridge, or ring setups, while the baseline project requires every node to see every other—limiting deployment options.
  • Broader benchmarks ahead: Next instalments will cover multimodal models, single-node versus cluster comparisons, and batch-inference scaling—suggestions welcome
  • How four everyday Mac Mini M4 machines reveal a clear latency-and-throughput gap

    Why we’re publishing benchmarks

    Modern AI teams need solutions that keep data on-prem, cut cloud spend, and still deliver accurate, instantaneous responses. No matter your industry or use case—whether you’re powering a private patient-data copilot for a hospital or running a maintenance-manual assistant on edge devices—the core question is the same:

    “Which platform actually delivers real-time performance on hardware I can buy (or already own) today?”

    To answer that, we’re launching a public benchmark series. Each instalment measures webFrame against another framework on identical Apple-silicon hardware, using publicly available models and repeatable prompts.

    For this first post we looked at a widely used open-source project that, like webFrame, distributes LLM inference across local Apple-silicon devices. (We’re keeping names out of the blog to stay focused on the data, but we’ll happily share full details with anyone who wants to reproduce the tests.)

    What is webFrame?

    webFrame is the webAI tool that lets you deploy large AI models across multiple machines on your own network.

    • Runs big models on small clusters – webFrame breaks large models into shards that run across several local machines, which then cooperate as one.

    • Navigator does the wiring – it connects your devices together and optimises the cluster for you.

    • Why it matters – you get sub-2-second first-token latency on state-of-the-art models without sending tokens—or data—to the cloud.

    Test setup

    Both frameworks were tested in a distributed setting on the same four-node Apple-silicon cluster, using the same prompts and quantised models. We captured two real-world metrics: Time to First Token (TTFT) and tokens per second (tok/s).

    Item Details
    Hardware 4 × Mac Mini M4 Pro, 64 GB RAM, macOS 15.4.1
    Network topologies Ethernet mesh and Thunderbolt-bridge mesh (open-source baseline can’t form a Thunderbolt ring)
    Models • Llama-3 70B Instruct (4-bit)
    • DeepSeek-Coder V2 Lite (4-bit)
    Procedure Warm-up prompt “What is the capital of France?” followed by long prompt “Write an essay about cats.” Metrics captured: Time to First Token (TTFT) and tokens per second (tok/s).

    Figures below use the fastest configuration the comparison project supports (Thunderbolt bridge).

    Results at a glance

    webFrame turns out tokens significantly faster—and, for the larger model, starts responding sooner—than the open-source baseline.

    Model TTFT (s)
    open-source
    TTFT (s)
    webFrame
    tok/s
    open-source
    tok/s
    webFrame
    Llama-3 70B 2.732 1.7868 2.342 5.8055
    DeepSeek-Coder V2 Lite 0.258 0.2537 9.284 32.4747

    What the numbers mean

    • Higher throughput, lower wait time – On identical Macs, webFrame delivers ≈ 2.5 × more tok/s on Llama-3 70B and ≈ 3.5 × more on DeepSeek-Coder V2 Lite, while trimming ~35 % off first-token latency for the larger model.

    • Topology flexibility – The comparison project requires every node to see every other, so Thunderbolt ring mode is out; webFrame isn’t constrained by that networking rule.

    • Why it matters – Faster throughput means fewer machines (and less power) for the same workload, or more concurrent users per cluster. Lower TTFT keeps interactive apps feeling instant.

    These results position webFrame as the clear leader in distributed LLM inference—we’re not aware of any framework, inside or outside Apple, that matches its TTFT or tok/s on identical four-node Apple-silicon clusters.

    Where we’re headed next

    • More models – Multimodal and batch-inference tests are on the way so results mirror real production mixes.

    • Monolith and distributed views – Future posts will show how the same models perform on a single Mac versus clusters of various sizes, so you can weigh simplicity against raw speed.

    • Streamlined tooling – We’re tightening our internal benchmark process to add new models and frameworks faster.

    • Your input welcome – Have a framework or model you’d like to see in a head-to-head? Hit us up on X, LinkedIn, or email and let us know.

    Takeaway for builders

    Distributed inference on commodity Macs isn’t just possible—it’s fast. In this first public head-to-head, webFrame achieves lower latency and dramatically higher throughput than a leading open-source alternative on the same four-node cluster. If you’re evaluating on-prem LLM options, the data are clear: sharding with webFrame lets you keep data local without sacrificing speed and responsiveness.

    Benchmark Series #2 is already in the works—stay tuned!

    Unlocking the impact & potential of AI:
    Read the full report today.
    Download now
    Unlocking the impact & potential of AI:
    Read the full report today.
    Download now