Cerebras timing model

Inference Speed Lab

A practical timing model anchored to Cerebras' 981 t/s Kimi K2.6 private enterprise-trial result. OpenAI has separately announced GPT-5.6 Sol on Cerebras at up to 750 t/s for selected customers.

Output size 800 tokens

Comparison speed 55 t/s

Demo speed 1x

Cerebras Kimi K2.6

981 t/s fast lane

0.8s

0 generated 800 left

Adjustable baseline

Frontier reasoning lane

14.5s

0 generated 800 left

Agent loop

The bigger unlock is repeated steps.

Eight model calls, 420 output tokens each, plus 0.35 seconds of non-model overhead per call. Same task shape, different amount of thinking before attention breaks.

Design read: At the current settings, the fast lane keeps the loop in the same interaction while the slower lane feels like a background job.

Architecture

Fast inference changes what you bother building.

Buffer the UI

At very high token rates, rendering every chunk can lag behind the model. Batch display updates instead of treating every event as one token.

Stream selectively

Short answers can return synchronously. Streaming still helps long outputs, but it stops being the default answer to every latency problem.

Keep loops close

Read, plan, edit, test, and repair can fit into one request path for more tasks. That changes the product from "come back later" to "stay with it".

Route by job

Use the lowest-cost fast-enough lane for motion, then escalate to high-effort GPT-5.6 Sol or another frontier model when judgement, reliability, or final review is worth the extra compute.

Source frame