The Proxy Buffer Nobody Configures Until Streaming Silently Breaks in Production

The team had spent six weeks building an LLM-powered customer support assistant. The core feature — the one that made it feel alive — was streaming. Tokens appeared on screen as they were generated. The interface felt responsive and instantaneous. Users in the demo loved it.

Staging confirmed it worked. The load test passed. The deployment went smoothly on a Wednesday afternoon.

By Thursday morning, the support channel had four tickets describing the same behaviour: the assistant would appear to think for several seconds, then dump the entire response at once. No streaming. Just a long pause followed by a wall of text.

The LLM API was streaming correctly. The application server was receiving and forwarding the stream. The browser was capable of rendering incremental updates.

The problem was in a layer none of the tickets mentioned and none of the engineers had configured: the proxy buffer.

Server infrastructure and network

Why Streaming and Buffering Are Architecturally Opposed

Buffering exists because network I/O is expensive and bursty. Accumulating data before forwarding reduces the number of write operations and increases throughput for large payloads. For traditional web applications — serving HTML pages, JSON responses, image files — buffering is almost always the right default.

Streaming exists because latency-to-first-token is the defining quality metric for LLM applications. A user reading generated text token by token experiences the response as immediate and fluid. The same response delivered in a single payload after a 4-second generation wait feels slow — even if the generation time was identical.

These two objectives are directly opposed at the proxy layer:

Buffering holds data until it has enough to forward efficiently
Streaming requires forwarding every token the moment it arrives

When a proxy with default buffer settings sits in front of an LLM streaming endpoint, it does exactly what it was designed to do: it holds the incoming stream in its buffer, waits for a meaningful chunk, and then forwards. The LLM is streaming. The proxy is buffering. The user experiences buffering.

The Nginx Default That Nobody Reads Until It Breaks Something

Nginx's proxy_buffering directive defaults to on. This is documented. It is not hidden. The decision to make it the default is architecturally sound for traditional use cases.

What's non-obvious is what "on" means in practice for a streaming LLM endpoint.

When proxy_buffering on is active, Nginx allocates a set of buffers for each proxied response. The default configuration is:

proxy_buffer_size 4k;
proxy_buffers 8 4k;        # 8 buffers × 4k = 32k total
proxy_busy_buffers_size 8k;

For a streaming LLM response, here's what happens: tokens arrive at Nginx from the upstream server. Nginx writes them to its buffers rather than forwarding immediately. The buffer fills incrementally as tokens arrive. Nginx only forwards data downstream when either the buffer is full or the connection from the upstream closes.

Since LLM tokens arrive at roughly 10–50 bytes each, the 32k buffer might accumulate 640–3,200 tokens before forwarding. At 50 tokens per second, that's 12–64 seconds of buffering before the first forwarded response. The user sees nothing, then a wall of text.

The Fix: Three Approaches

Approach 1: Disable proxy buffering globally for the streaming location

location /api/chat {
    proxy_pass http://backend;
    proxy_buffering off;
    proxy_cache off;
    
    # Required for SSE
    proxy_set_header Connection '';
    proxy_http_version 1.1;
    chunked_transfer_encoding on;
}

proxy_buffering off tells Nginx to forward data to the client as it arrives from the upstream, without buffering. This is the correct configuration for any SSE endpoint.

Approach 2: Per-request override via response header

If you cannot modify your Nginx configuration (managed hosting, shared infrastructure), you can disable buffering per-response from your application server:

// In your Express route handler
res.setHeader('X-Accel-Buffering', 'no');

Nginx honours the X-Accel-Buffering: no response header and disables buffering for that specific response, even when proxy_buffering on is configured globally. This is the escape hatch when you don't control the Nginx config.

Approach 3: Complete Nginx configuration for production LLM streaming

upstream backend {
    server 127.0.0.1:3000;
    keepalive 32;
}

server {
    listen 443 ssl http2;
    server_name api.example.com;

    location /api/chat {
        proxy_pass http://backend;
        
        # Disable all buffering for streaming
        proxy_buffering off;
        proxy_cache off;
        proxy_read_timeout 120s;    # LLM can take up to 2 minutes
        proxy_connect_timeout 10s;
        proxy_send_timeout 120s;
        
        # SSE requirements
        proxy_set_header Connection '';
        proxy_http_version 1.1;
        
        # Standard proxy headers
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }

    location / {
        proxy_pass http://backend;
        # Standard buffering for non-streaming endpoints
        proxy_buffering on;
        proxy_buffer_size 4k;
        proxy_buffers 8 4k;
    }
}

Note the critical separation: buffering is disabled only for the streaming endpoint. All other routes retain the performance benefits of standard buffering.

DevOps engineer configuring server infrastructure

Verifying Your Configuration Works

After deploying the Nginx configuration, verify streaming is working correctly:

curl -N -H "Accept: text/event-stream" https://api.example.com/api/chat \
  -d '{"message": "Count to 10 slowly"}'

The -N flag disables curl's own buffering. If streaming is working correctly, you will see tokens appear incrementally in the terminal. If you see a pause followed by all tokens at once, the proxy is still buffering.

You can also verify the response headers confirm buffering is disabled:

curl -I https://api.example.com/api/chat
# Should include X-Accel-Buffering: no
# Should NOT include Transfer-Encoding: chunked as a pre-buffered response

Key Takeaways

Issue	Cause	Fix
Stream works in dev, buffers in production	Nginx `proxy_buffering on` default	`proxy_buffering off` on streaming location
Stream works locally, times out at 60s in production	AWS ALB default idle timeout	Increase ALB idle timeout
Stream works on server, not on Vercel	Serverless Function buffering	Use Edge Function for streaming endpoint
Stream works on desktop, fails on mobile	CDN layer buffering	Bypass CDN for streaming endpoint or configure pass-through

The LLM is streaming correctly. The proxy is doing its job correctly. The conflict between those two correct behaviours is the configuration gap that nobody reads about until it breaks something in production.

Add proxy_buffering off to your streaming location block. Add the X-Accel-Buffering: no header from your application server as a backup. Test with curl -N before deploying. These three steps prevent the Thursday morning support ticket.