REST vs Streaming vs WebSockets: Which One Do You Actually Need When Your App Talks to an LLM?
You integrate GPT-4o into your application using a standard REST call. You test it locally, get a response in 3 seconds, and think it looks fine. Then you deploy.
Real users start complaining the app feels broken. They stare at a blank text field for 4–8 seconds with no feedback, assume the request failed, click again, and now you have duplicate requests flooding your LLM API. Your costs double. Your error rate climbs.
None of this is caused by the LLM. It's caused by choosing the wrong communication protocol for how LLMs actually work.
LLMs generate tokens sequentially — word by word, roughly 30–80 tokens per second. The total response time for a 400-token reply at 50 tokens/second is 8 seconds. If you're waiting for the entire response before rendering anything, you're asking your users to stare at nothing for 8 seconds on every interaction.

Protocol 1: REST
REST is synchronous request-response over HTTP. Client sends a request, server processes it completely, server returns a response.
For LLM integration:
// REST implementation — simple, correct for the right use case
app.post('/api/classify', async (req, res) => {
const { text } = req.body;
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
{ role: 'system', content: 'Classify this text. Return: positive, negative, or neutral.' },
{ role: 'user', content: text },
],
max_tokens: 10,
});
res.json({ sentiment: response.choices[0].message.content.trim() });
});
REST is correct when: The LLM operation runs in the background, the user doesn't watch generation happen, or the response is short enough that generation completes in under 2 seconds. Sentiment analysis, document classification, feature extraction, email categorisation — all correct REST use cases.
REST is wrong when: Users are watching a text field waiting for words to appear. At 8 seconds of blank screen, users assume failure and retry.
Protocol 2: Server-Sent Events (SSE Streaming)
SSE is a unidirectional server-to-client push mechanism over standard HTTP. The server opens a persistent connection and pushes events to the client as they become available. The browser's native EventSource API handles reconnection automatically.
Backend implementation:
app.post('/api/chat/stream', async (req, res) => {
const { messages } = req.body;
// Required headers for SSE
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache, no-transform');
res.setHeader('X-Accel-Buffering', 'no'); // Nginx: disable proxy buffering
res.setHeader('Connection', 'keep-alive');
try {
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
max_tokens: 1000,
stream: true,
});
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || '';
if (token) {
res.write(`data: ${JSON.stringify({ token })}\n\n`);
}
}
res.write('data: {"done": true, "tokens": 847}\n\n');
} catch (err) {
res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
} finally {
res.end();
}
});
Frontend consumer using fetch + ReadableStream (for POST requests):
async function streamLLMResponse(messages, onToken, onComplete) {
const response = await fetch('/api/chat/stream', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ messages }),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
const lines = buffer.split('\n');
buffer = lines.pop(); // Keep incomplete line in buffer
for (const line of lines) {
if (!line.startsWith('data: ')) continue;
const data = JSON.parse(line.slice(6));
if (data.done) { onComplete(data); return; }
if (data.error) throw new Error(data.error);
if (data.token) onToken(data.token);
}
}
}
SSE is correct when: Users watch text appear in real time. Chat interfaces, code generation, long-form content creation. This is the right default for all interactive LLM features.
SSE limitations:
- Browser's
EventSourceonly supports GET — usefetch()+ReadableStreamfor POST (as above) - Nginx, AWS ALB, and Vercel Serverless Functions each have specific configuration requirements to not buffer the stream
Protocol 3: WebSockets
WebSockets provide a persistent bidirectional connection between client and server. Both sides can send messages at any time without the overhead of establishing a new HTTP connection.
const wss = new WebSocketServer({ port: 8080 });
wss.on('connection', (ws) => {
let conversationHistory = [];
ws.on('message', async (data) => {
const { content } = JSON.parse(data);
conversationHistory.push({ role: 'user', content });
ws.send(JSON.stringify({ type: 'start' }));
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: conversationHistory,
stream: true,
});
let fullResponse = '';
for await (const chunk of stream) {
const token = chunk.choices[0]?.delta?.content || '';
if (token) {
fullResponse += token;
ws.send(JSON.stringify({ type: 'token', token }));
}
}
conversationHistory.push({ role: 'assistant', content: fullResponse });
ws.send(JSON.stringify({ type: 'done' }));
});
});
WebSockets are correct when:
- Users need to interrupt or redirect generation mid-stream
- Real-time bidirectional events alongside LLM responses (audio, presence indicators, live collaboration)
- Multiple users share a conversation and need to see each other's messages and the AI response in real time
WebSockets are over-engineered for: Most LLM chat applications. If your use case is user sends message → AI responds, SSE handles that perfectly. WebSockets add connection state management, keepalive complexity, and infrastructure requirements (load balancers must support WebSocket upgrade) with no user experience benefit.

The Comparison Matrix
| Dimension | REST | SSE Streaming | WebSockets |
|---|---|---|---|
| Direction | Client → Server → Client | Server → Client | Bidirectional |
| Complexity | Low | Medium | High |
| User experience (long LLM response) | 8s blank screen | First token at 300ms | First token at 300ms |
| Infrastructure requirements | Standard HTTP | proxy_buffering off in Nginx | WS-aware load balancer |
| State management | Stateless | Stateless | Connection state required |
| Reconnection | Client retries | Automatic (EventSource) | Manual implementation |
| Right for: | Background processing, short responses | Interactive chat, live generation | Voice, collaboration, interruption |
The Infrastructure Checklist
Regardless of which protocol you choose, confirm these before deploying:
For SSE (most common failure source):
proxy_buffering offin Nginx for the streaming location- AWS ALB idle timeout increased above max expected generation time
- Vercel: Edge Function (not Serverless Function) for streaming endpoint
X-Accel-Buffering: noresponse header as defence-in-depth
For WebSockets:
- Load balancer configured to pass WebSocket upgrade headers (
Upgrade: websocket) - Sticky sessions if running multiple backend instances (WebSocket connections are stateful)
- Keepalive pings to prevent idle timeout disconnection
For REST:
- Explicit
max_tokensto prevent unbounded response sizes - Request timeout configured (30s recommended)
- Retry logic with exponential backoff for 429 rate limit errors
The wrong protocol choice is the most common source of AI feature complaints in production. The right choice takes 30 seconds to identify with the decision matrix above — and saves hours of debugging after deployment.





