You Shipped an AI Feature. You Forgot to Ship the Speed.

The feature works. The LLM returns the right answer. The prompt is well-engineered. You deploy it, share the link with your team, and someone opens it on their phone. They type a question. They wait. A spinner rotates for eleven seconds. Then the full response appears at once.

They close the tab.

Not because the answer was wrong. Because the experience felt broken. In 2025, any interface that makes a user wait more than two seconds for visible feedback has already failed — regardless of how good the eventual output is.

This is the gap between shipping an AI feature and shipping an AI experience. The feature is the model. The experience is the architecture around it.

Developer working on performance optimization

The Latency Problem Is Not Where You Think It Is

The instinct when an AI feature feels slow is to blame the model. Switch to a faster model, reduce max tokens, optimise the prompt. These help at the margins. They do not address the core problem.

In a typical production GenAI application, the latency a user experiences is composed of four distinct segments — and most developers optimise for the wrong one.

Time to First Token (TTFT) is the time from request submission to the first token appearing on screen. This is what the user feels as "waiting." Everything before the first character appears is dead time.

Token generation rate is how fast tokens arrive after the first one — the typewriter speed. Determined primarily by model speed and response length.

Total generation time is the time from first to last token. This is the metric most developers optimise, but it barely affects perceived responsiveness.

Render time is how long the browser takes to display each incoming token. On complex markdown rendering setups, this becomes the hidden bottleneck.

The TTFT for a non-streaming implementation equals the total generation time. For a 400-token response at 50 tokens/second, that is 8 seconds of a spinning loader. The TTFT for a streaming implementation can be as low as 150–300 milliseconds.

Streaming does not make the model faster. It makes the wait invisible.

Users perceive systems as responsive when they receive feedback within 100–300 milliseconds. They begin to question whether the system is working after 1–2 seconds of silence. Streaming addresses this by giving the first token — even just the word "The" — immediately, shifting users from waiting mode to reading mode.

The Five Layers That Must All Be Right

Most developers build their first GenAI feature the same way: call the LLM API, wait for the complete response, return it, render it. This is wrong for LLM output.

A streaming implementation requires changes at every layer of the stack:

Layer 1 — The LLM API call must use streaming mode. The API returns immediately; tokens are delivered incrementally as they are generated.

Layer 2 — The backend server must forward the stream to the client as it arrives — not buffer it and send in bulk. Specific HTTP response headers are required, and middleware (like express-compression) can silently collapse the stream.

Layer 3 — The transport protocol must support server-to-client push. Server-Sent Events (SSE) is the correct choice for most GenAI applications. WebSockets introduce unnecessary complexity for what is fundamentally a one-directional token stream.

Layer 4 — The frontend must consume the stream incrementally and update state on each chunk without re-rendering the entire component on every token. A naive implementation that triggers full re-renders on each incoming token produces visible stutter on responses longer than ~200 tokens.

Layer 5 — The infrastructure — load balancers, reverse proxies, and CDN layers must pass streaming responses through without buffering or timeout. This is the layer most developers forget. It fails silently in production while everything looks correct in development.

Server infrastructure and streaming architecture

Implementing SSE Correctly in Node.js

Server-Sent Events is the right transport for LLM token streaming. It is built on standard HTTP, supported natively by all modern browsers, automatically reconnects on connection loss, and is stateless — scaling horizontally without session affinity.

The SSE wire format is simple:

data: {"token": "Hello"}\n\n
data: {"token": " world"}\n\n
data: {"done": true}\n\n

The complete Express implementation requires three critical response headers:

res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache, no-transform');
res.setHeader('X-Accel-Buffering', 'no'); // critical for Nginx

const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  stream: true,
});

for await (const chunk of stream) {
  const token = chunk.choices[0]?.delta?.content || '';
  if (token) {
    res.write(`data: ${JSON.stringify({ token })}\n\n`);
  }
}
res.write('data: {"done": true}\n\n');
res.end();

The X-Accel-Buffering: no header is the one developers miss. It instructs Nginx to disable proxy buffering for this response specifically, even when proxy_buffering on is set globally.

What Happens in Those First 300 Milliseconds

Understanding TTFT at a granular level reveals where it can be reduced and where attempting to reduce it further yields diminishing returns.

Component	Typical Range	Reducible?
Client request construction	1–5ms	No (browser)
Network transit to server	10–200ms	Yes — Edge deployment
Backend processing	5–50ms	Yes — middleware optimisation
LLM API request transmission	10–30ms	Partial
LLM prefill (irreducible)	50–200ms	No — transformer architecture
Network transit first token back	10–200ms	Yes — Edge deployment
SSE parse and React render	1–10ms	Yes — render optimisation

The actionable insight: steps 2, 3, and 6 are the most commonly optimisable. Edge deployment reduces 2 and 6. Backend middleware optimisation reduces step 3. LLM prefill is the irreducible floor.

React Client Implementation

function useLLMStream(prompt) {
  const [content, setContent] = useState('');
  const [isStreaming, setIsStreaming] = useState(false);

  const startStream = async () => {
    setIsStreaming(true);
    setContent('');
    
    const eventSource = new EventSource(`/api/chat?prompt=${encodeURIComponent(prompt)}`);
    
    eventSource.onmessage = (e) => {
      const data = JSON.parse(e.data);
      if (data.done) {
        eventSource.close();
        setIsStreaming(false);
      } else {
        setContent(prev => prev + data.token);
      }
    };
  };

  return { content, isStreaming, startStream };
}

Note the state update pattern: setContent(prev => prev + data.token) accumulates tokens into the existing state rather than replacing it. This prevents the entire rendered output from re-mounting on each token.

Code on laptop screen

The Infrastructure Layer That Breaks Everything Silently

AWS Application Load Balancer has a default idle timeout of 60 seconds. If your LLM takes 65 seconds to generate a long response, the load balancer cuts the connection at 60 seconds. The user sees a truncated response with no error.

Nginx's proxy_buffering directive defaults to on. For a streaming endpoint this means every token gets buffered until the buffer is full before being forwarded. The LLM is streaming. The proxy is buffering. The user experiences buffering.

The fixes for each:

AWS ALB: Increase the idle timeout beyond your maximum expected generation time. If responses take up to 60 seconds, set the timeout to 90 seconds.

Nginx: Add proxy_buffering off; to your location block for the streaming endpoint, or use the X-Accel-Buffering: no response header to override per-request.

Vercel: Edge Functions inherit streaming behaviour automatically. Serverless Functions require the response to be returned as a ReadableStream, not awaited.

Key Takeaways

The difference between a broken-feeling AI feature and a responsive one is not model speed — it is whether streaming is implemented correctly at all five layers of the stack.

Implement streaming at the LLM API call level
Forward streams immediately in your backend without buffering middleware
Use SSE for unidirectional token delivery
Accumulate tokens in React state without triggering full re-renders
Configure your infrastructure layer — load balancer timeouts, Nginx proxy buffering — before deployment

The model did not get faster. The wait became invisible. That is the entire difference.