How to Connect Any LLM to Your Web App Without Breaking Everything Else

You've picked your model. You have an API key. You write the first fetch call, get a response back, and think: this is straightforward.

Three weeks later, your app is timing out under load, your users are seeing blank responses when the API rate-limits, your token costs are 4× what you projected, and the LLM is occasionally returning malformed JSON that crashes your parser and takes down the feature for everyone.

None of this is caused by bad code. It's caused by a set of integration decisions that look fine in development and only reveal their failure modes under real conditions — latency spikes, concurrent users, prompt edge cases, and the unpredictable behaviour of probabilistic systems embedded in deterministic applications.

AI technology integration architecture

The Mental Model Most Developers Get Wrong

Most developers treat LLM integration like a standard API call: send a request, receive a response, render it. This model works in development. It fails in production because it treats the LLM as a deterministic service with predictable latency, structured output, and reliable availability.

LLMs are none of those things.

They are probabilistic. The same prompt returns different outputs on different calls. Your application needs to handle this at the architecture level.

They are slow. Median response times for GPT-4o range from 2–8 seconds depending on output length. This is not a problem to be optimised away — it's a fundamental characteristic of autoregressive generation. Your UI, timeout settings, and user experience must be designed around it.

They are expensive in ways that scale unexpectedly. Token costs seem trivial at low volume. At 10,000 users per day making three LLM calls each, small prompt inefficiencies compound into significant monthly bills.

They return unstructured text by default. Even when you ask for JSON, you don't always get valid JSON. Your parser will break. Your feature will break with it unless you design for this explicitly.

The correct mental model: you're integrating a powerful but non-deterministic, high-latency, cost-bearing external service into a system that your users expect to be fast, reliable, and consistent.

Architecture Decision: Where Does the LLM Call Live?

Before writing integration code, decide where in your architecture the LLM call lives. This decision is harder to change later than almost any other architectural decision.

Option 1: Direct client-to-LLM calls

Never do this. Your API key is exposed to anyone who opens browser dev tools. You have no ability to rate-limit users, log usage, inject server-side context, implement caching, or control costs.

Option 2: Backend proxy (the correct baseline)

Your frontend calls your backend. Your backend calls the LLM API. This is the minimum viable production architecture — it keeps your API key server-side, allows rate limiting, enables logging, and lets your backend augment prompts with server-side context.

Option 3: Backend proxy with edge middleware

For applications with global users where latency is critical, an edge middleware layer handles authentication and rate limiting at the network edge, while the origin backend makes the LLM call. This reduces geographic latency.

Option 4: Async queue-based architecture

For LLM operations that don't need to return results in real time — generating reports, processing uploaded documents, batch analysis — the LLM call goes into a job queue. The user gets immediate acknowledgment. Results are delivered when ready. Use this when LLM tasks exceed 10–15 seconds.

The Three Failure Modes You Will Encounter

Failure Mode 1: Timeouts Under Load

By default, Node.js HTTP clients don't time out. An LLM request that takes 45 seconds holds a connection and consumes server resources. Under load, this multiplies.

async function callLLMWithTimeout(messages, timeoutMs = 30000) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), timeoutMs);
  
  try {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages,
      max_tokens: 1000,
    }, { signal: controller.signal });
    
    return response;
  } catch (err) {
    if (err.name === 'AbortError') {
      throw new Error('LLM request timed out — try a shorter response');
    }
    throw err;
  } finally {
    clearTimeout(timeout);
  }
}

Failure Mode 2: Malformed JSON Crashes the Parser

When you ask the LLM to return JSON, it sometimes returns JSON with explanatory text before it, incomplete JSON, or valid JSON wrapped in a markdown code block. Each of these crashes a naive JSON.parse().

function extractAndParseJSON(rawResponse) {
  // Remove markdown code blocks
  let cleaned = rawResponse.replace(/```json\n?/g, '').replace(/```\n?/g, '');
  
  // Try to extract JSON from the string
  const jsonMatch = cleaned.match(/\{[\s\S]*\}/);
  if (!jsonMatch) {
    throw new ParseError('No JSON object found in response');
  }
  
  try {
    return JSON.parse(jsonMatch[0]);
  } catch {
    throw new ParseError('Extracted content is not valid JSON');
  }
}

Better: use structured outputs (JSON mode) if your provider supports it:

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  response_format: { type: 'json_object' }, // Guaranteed valid JSON
});

Failure Mode 3: Rate Limit Errors Reach Users

LLM API providers rate-limit at the account level, not per-user. When your application hits the rate limit, every user experiences the error simultaneously. Most implementations don't handle this gracefully.

async function callLLMWithRetry(messages, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await callLLM(messages);
    } catch (err) {
      if (err.status === 429 && attempt < maxRetries) {
        // Exponential backoff: 1s, 2s, 4s
        const delay = Math.pow(2, attempt - 1) * 1000;
        await sleep(delay);
        continue;
      }
      throw err;
    }
  }
}

Full stack developer working on integration

Cost Management Architecture

Token costs are the most common surprise in LLM production deployments. The costs that developers underestimate:

System prompt tokens: If you have a 500-token system prompt and each user sends 50 tokens, you're sending 550 tokens per request — 91% of the cost is your own instructions.

Conversation history tokens: Each turn of a multi-turn conversation includes all previous turns. A 10-turn conversation sends the full history with each message. Token count grows quadratically with conversation length.

Implement token budgets:

function trimConversationHistory(history, maxTokens = 6000) {
  let tokenCount = 0;
  const trimmed = [];
  
  // Work backwards from most recent
  for (let i = history.length - 1; i >= 0; i--) {
    const estimatedTokens = Math.ceil(history[i].content.length / 4);
    
    if (tokenCount + estimatedTokens > maxTokens) break;
    
    trimmed.unshift(history[i]);
    tokenCount += estimatedTokens;
  }
  
  return trimmed;
}

Cache identical requests:

async function getCachedOrCallLLM(userId, systemPrompt, userMessage) {
  const cacheKey = `llm:${userId}:${hash(userMessage)}`;
  
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);
  
  const response = await callLLM([
    { role: 'system', content: systemPrompt },
    { role: 'user', content: userMessage },
  ]);
  
  await redis.setex(cacheKey, 3600, JSON.stringify(response));
  return response;
}

The Production Integration Checklist

Before deploying any LLM integration to production:

LLM calls run through your backend server, never directly from the browser
All LLM routes have explicit request timeouts (30s recommended)
max_tokens is set explicitly on every call (never unbounded)
JSON responses use structured output mode or are parsed with malformed-response handling
Rate limit errors are caught and retried with exponential backoff
Per-user rate limiting is implemented
Conversation history is trimmed to a token budget before each call
Token usage is logged per request for cost attribution
A cost ceiling is enforced per user per day/month

The LLM works. The integration around it is where production systems fail. Get these nine items right and you move from "demo quality" to "production quality."