Choosing the Right Protocol When Your App Talks to an AI: A Practical Framework

You integrate an LLM into your application. You test it locally, it responds in 3 seconds, and you think it looks acceptable. Then you deploy to production.

Real users start complaining the app feels broken. They stare at a blank text field for 4–8 seconds with no feedback, assume the request failed, click again, and now you have duplicate requests flooding your LLM API. Your costs double. Your error rate climbs. Your UX team gets blamed for engineering decisions.

None of this is caused by the LLM. It's caused by choosing the wrong communication protocol for how LLMs actually work.

LLMs generate tokens sequentially — word by word, roughly 30–80 tokens per second depending on the model. The total response time for a 400-token reply at 50 tokens/second is 8 seconds. If you're waiting for the entire response before rendering anything, you're asking your users to stare at nothing for 8 seconds every single interaction.

Technology and coding on computer screen

Why Protocol Choice Matters More With LLMs

With a typical REST API — a database query, a payment processor, a weather service — the response time is 50–500 milliseconds. Users don't notice a 300ms wait. The request-response model feels instantaneous.

LLM inference is categorically different. The response isn't computed and returned — it's generated incrementally. The first token might arrive in 200–400 milliseconds (time to first token), but the full response takes 3–15 seconds depending on length and model.

This changes the entire user experience calculus. In LLM contexts, the protocol choice directly determines:

Whether users know the system is working (or assume it failed)
Whether users can start reading useful content early (or must wait for completion)
Whether you can support multi-turn conversations efficiently
How much infrastructure overhead you're paying per interaction

The Three Protocols and When Each Is Right

REST: Request-Response

REST makes sense for LLM integration when the user doesn't need to watch generation happen. Background classification, document analysis, batch processing, feature extraction — all of these are better served by REST. You fire a request, wait for completion, use the result.

REST also makes sense for short responses (under ~100 tokens) where generation time is under 2 seconds. The user experience difference between REST and streaming is imperceptible for sub-second responses.

When REST breaks for LLM: Interactive chat. Code generation with live preview. Any UI where the user is watching and waiting. The 8-second blank screen is not the user's tolerance failure — it's the protocol choice's failure.

Server-Sent Events (SSE): The Right Default for LLM Streaming

SSE is unidirectional — server pushes events to the client over a persistent HTTP connection. The client receives a stream of events as they arrive. For LLM token streaming, this is the correct default.

Why SSE over WebSockets for most LLM applications:

Unidirectional by design — you're pushing tokens to the client, not exchanging messages bidirectionally
Built on standard HTTP — no protocol upgrade, better compatibility with proxies and CDNs
Automatic reconnect built into the browser's EventSource API
Stateless — scales horizontally without session affinity

// Backend SSE endpoint
app.get('/api/chat/stream', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('X-Accel-Buffering', 'no');
  
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: JSON.parse(req.query.messages),
    stream: true,
  });
  
  for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content || '';
    if (token) res.write(`data: ${JSON.stringify({ token })}\n\n`);
  }
  
  res.write('data: {"done": true}\n\n');
  res.end();
});

Limitation of SSE: The browser's EventSource API only supports GET requests. For chat interfaces where you need to POST the conversation history (which can be large), you need to either use query parameters (not ideal for large histories) or use fetch() with ReadableStream instead.

WebSockets: For Genuine Bidirectionality

WebSockets are the right choice when your application needs genuine bidirectional real-time communication — not just token streaming.

Use WebSockets when:

Users need to interrupt or redirect generation mid-stream
You're building a collaborative feature where multiple users interact simultaneously
You need low-latency bidirectional messaging alongside LLM responses (audio, presence, live collaboration)

WebSockets are over-engineered for most LLM chat applications. They introduce connection state management, reconnection complexity, and infrastructure requirements (WebSocket-aware load balancers) that SSE avoids entirely.

The Failure Patterns by Protocol

Failure	Protocol Mismatch	Fix
8-second blank screen, users click twice	REST for interactive chat	Switch to SSE streaming
Stream works locally, breaks in production	SSE behind Nginx with default buffering	Add `proxy_buffering off` to streaming location
Streaming works, conversation history lost on refresh	WebSockets without persistence layer	Add Redis/database session storage
Can't POST large conversation history	EventSource with GET only	Use `fetch()` + ReadableStream
WebSocket connections drop every 60s	AWS ALB default timeout	Increase idle timeout or add keepalive pings

Choosing for Your Specific Use Case

Interactive chat assistant → SSE streaming. Users type, response streams token by token. Lowest implementation complexity, best user experience.

Document analysis with progress indicator → REST + polling. User uploads document, gets request ID, polls /api/status/{id} every 2 seconds. Generation runs asynchronously.

Live code generation → SSE streaming. Developer watches code appear as it's generated.

Collaborative AI document editing → WebSockets. Multiple users see each other's cursors and AI contributions in real time.

Email classification pipeline → REST. Background processing, no user watching.

Voice conversation with AI → WebSockets. Audio input/output requires true bidirectionality.

AI chatbot interface design

The Infrastructure Layer That Breaks Everything

The protocol you choose in your application code is not the protocol your users experience if there's infrastructure between your backend and the browser.

AWS ALB: Default 60-second idle timeout. LLM responses longer than 60 seconds get cut off. Fix: increase idle timeout in target group settings.

Nginx: proxy_buffering on by default. Accumulates streaming tokens until buffer is full. Fix: proxy_buffering off on the streaming location block.

Cloudflare: Streaming typically passes through, but caching rules can buffer responses. Fix: exclude streaming endpoints from cache rules.

Vercel Serverless Functions: Cannot stream responses. Fix: use Edge Functions (Runtime: edge) for any streaming endpoint.

The application-level protocol choice is step one. The infrastructure configuration that passes that protocol through correctly is step two. Teams that skip step two discover the gap in production.

The Summary Decision

The question is not "REST or streaming?" The question is "what experience does my user need?"

If the user needs to see output as it arrives: SSE streaming. If the user needs bidirectional interaction mid-generation: WebSockets. If the user doesn't need to watch the generation: REST.

Everything else is implementation detail. Get the user experience model right first, then choose the protocol that delivers it, then configure the infrastructure to not interfere with it.