Inference Speed Lab

Cerebras speed demo

Inference Speed Lab

A practical timing model for how near-1,000-token-per-second inference changes the amount of drafting, checking, and repairing that can fit inside one human attention window.

800 tokens
55 t/s
1x

Cerebras class

981 t/s fast lane

0.8s
0 generated 800 left

      

Adjustable baseline

Frontier reasoning lane

14.5s
0 generated 800 left

      

Agent loop

The bigger unlock is repeated steps.

Eight model calls, 420 output tokens each, plus 0.35 seconds of non-model overhead per call. Same task shape, different amount of thinking before attention breaks.

Design read: At the current settings, the fast lane keeps the loop in the same interaction while the slower lane feels like a background job.

Architecture

Fast inference changes what you bother building.

01

Buffer the UI

At very high token rates, rendering every chunk can lag behind the model. Batch display updates instead of treating every event as one token.

02

Stream selectively

Short answers can return synchronously. Streaming still helps long outputs, but it stops being the default answer to every latency problem.

03

Keep loops close

Read, plan, edit, test, and repair can fit into one request path for more tasks. That changes the product from "come back later" to "stay with it".

04

Route by job

Use the fast model for motion, then escalate to Claude or GPT-5.5 when judgement, reliability, or final review is worth the slower turn.

Source frame

What the demo is grounded in.