Backend Development

Choosing the Right Protocol When Your App Talks to an AI: A Practical Framework

REST, Server-Sent Events, or WebSockets? Choosing without understanding the tradeoffs produces systems that feel broken to users. Here's the decision framework for AI-integrated apps, with the specific failure patterns that emerge when you choose wrong.

Meritshot Team6 min read
REST APIStreamingWebSocketsBackend DevelopmentAIProtocol DesignArchitecture
Back to Blog

Choosing the Right Protocol When Your App Talks to an AI: A Practical Framework

You integrate an LLM into your application. You test it locally, it responds in 3 seconds, and you think it looks acceptable. Then you deploy to production.

Real users start complaining the app feels broken. They stare at a blank text field for 4–8 seconds with no feedback, assume the request failed, click again, and now you have duplicate requests flooding your LLM API. Your costs double. Your error rate climbs. Your UX team gets blamed for engineering decisions.

None of this is caused by the LLM. It's caused by choosing the wrong communication protocol for how LLMs actually work.

LLMs generate tokens sequentially — word by word, roughly 30–80 tokens per second depending on the model. The total response time for a 400-token reply at 50 tokens/second is 8 seconds. If you're waiting for the entire response before rendering anything, you're asking your users to stare at nothing for 8 seconds every single interaction.

Technology and coding on computer screen

Why Protocol Choice Matters More With LLMs

With a typical REST API — a database query, a payment processor, a weather service — the response time is 50–500 milliseconds. Users don't notice a 300ms wait. The request-response model feels instantaneous.

LLM inference is categorically different. The response isn't computed and returned — it's generated incrementally. The first token might arrive in 200–400 milliseconds (time to first token), but the full response takes 3–15 seconds depending on length and model.

This changes the entire user experience calculus. In LLM contexts, the protocol choice directly determines:

  • Whether users know the system is working (or assume it failed)
  • Whether users can start reading useful content early (or must wait for completion)
  • Whether you can support multi-turn conversations efficiently
  • How much infrastructure overhead you're paying per interaction

The Three Protocols and When Each Is Right

REST: Request-Response

REST makes sense for LLM integration when the user doesn't need to watch generation happen. Background classification, document analysis, batch processing, feature extraction — all of these are better served by REST. You fire a request, wait for completion, use the result.

REST also makes sense for short responses (under ~100 tokens) where generation time is under 2 seconds. The user experience difference between REST and streaming is imperceptible for sub-second responses.

When REST breaks for LLM: Interactive chat. Code generation with live preview. Any UI where the user is watching and waiting. The 8-second blank screen is not the user's tolerance failure — it's the protocol choice's failure.

Server-Sent Events (SSE): The Right Default for LLM Streaming

SSE is unidirectional — server pushes events to the client over a persistent HTTP connection. The client receives a stream of events as they arrive. For LLM token streaming, this is the correct default.

Why SSE over WebSockets for most LLM applications:

  • Unidirectional by design — you're pushing tokens to the client, not exchanging messages bidirectionally
  • Built on standard HTTP — no protocol upgrade, better compatibility with proxies and CDNs
  • Automatic reconnect built into the browser's EventSource API
  • Stateless — scales horizontally without session affinity
// Backend SSE endpoint
app.get('/api/chat/stream', async (req, res) => {
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('X-Accel-Buffering', 'no');
  
  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: JSON.parse(req.query.messages),
    stream: true,
  });
  
  for await (const chunk of stream) {
    const token = chunk.choices[0]?.delta?.content || '';
    if (token) res.write(`data: ${JSON.stringify({ token })}\n\n`);
  }
  
  res.write('data: {"done": true}\n\n');
  res.end();
});

Limitation of SSE: The browser's EventSource API only supports GET requests. For chat interfaces where you need to POST the conversation history (which can be large), you need to either use query parameters (not ideal for large histories) or use fetch() with ReadableStream instead.

WebSockets: For Genuine Bidirectionality

WebSockets are the right choice when your application needs genuine bidirectional real-time communication — not just token streaming.

Use WebSockets when:

  • Users need to interrupt or redirect generation mid-stream
  • You're building a collaborative feature where multiple users interact simultaneously
  • You need low-latency bidirectional messaging alongside LLM responses (audio, presence, live collaboration)

WebSockets are over-engineered for most LLM chat applications. They introduce connection state management, reconnection complexity, and infrastructure requirements (WebSocket-aware load balancers) that SSE avoids entirely.

The Failure Patterns by Protocol

FailureProtocol MismatchFix
8-second blank screen, users click twiceREST for interactive chatSwitch to SSE streaming
Stream works locally, breaks in productionSSE behind Nginx with default bufferingAdd proxy_buffering off to streaming location
Streaming works, conversation history lost on refreshWebSockets without persistence layerAdd Redis/database session storage
Can't POST large conversation historyEventSource with GET onlyUse fetch() + ReadableStream
WebSocket connections drop every 60sAWS ALB default timeoutIncrease idle timeout or add keepalive pings

Choosing for Your Specific Use Case

Interactive chat assistant → SSE streaming. Users type, response streams token by token. Lowest implementation complexity, best user experience.

Document analysis with progress indicator → REST + polling. User uploads document, gets request ID, polls /api/status/{id} every 2 seconds. Generation runs asynchronously.

Live code generation → SSE streaming. Developer watches code appear as it's generated.

Collaborative AI document editing → WebSockets. Multiple users see each other's cursors and AI contributions in real time.

Email classification pipeline → REST. Background processing, no user watching.

Voice conversation with AI → WebSockets. Audio input/output requires true bidirectionality.

AI chatbot interface design

The Infrastructure Layer That Breaks Everything

The protocol you choose in your application code is not the protocol your users experience if there's infrastructure between your backend and the browser.

AWS ALB: Default 60-second idle timeout. LLM responses longer than 60 seconds get cut off. Fix: increase idle timeout in target group settings.

Nginx: proxy_buffering on by default. Accumulates streaming tokens until buffer is full. Fix: proxy_buffering off on the streaming location block.

Cloudflare: Streaming typically passes through, but caching rules can buffer responses. Fix: exclude streaming endpoints from cache rules.

Vercel Serverless Functions: Cannot stream responses. Fix: use Edge Functions (Runtime: edge) for any streaming endpoint.

The application-level protocol choice is step one. The infrastructure configuration that passes that protocol through correctly is step two. Teams that skip step two discover the gap in production.

The Summary Decision

The question is not "REST or streaming?" The question is "what experience does my user need?"

If the user needs to see output as it arrives: SSE streaming. If the user needs bidirectional interaction mid-generation: WebSockets. If the user doesn't need to watch the generation: REST.

Everything else is implementation detail. Get the user experience model right first, then choose the protocol that delivers it, then configure the infrastructure to not interfere with it.

Recommended