You Secured the API. The LLM Route You Added Last Week Wasn't.

The core API took six months to harden. Rate limiting. JWT authentication with short expiry windows. Input validation middleware that rejects malformed payloads before they touch business logic. WAF rules tuned to your traffic patterns. An audit log that captures every authenticated request. A security review that two senior engineers signed off on.

Then the product team asked for an AI feature. The sprint was two weeks. The LLM endpoint went in on Wednesday.

Here's what that endpoint has: an API key passed in the request header. A call to the LLM provider's API. A response returned to the client.

Here's what it doesn't have: the rate limiting your other endpoints have. Input size constraints. Any logging that captures what was actually sent to the model. Cost controls that would catch an abuse pattern before it generates a four-figure bill in a weekend. Authorization logic that checks whether the calling user is permitted to use the AI feature. A timeout that prevents the request from holding a connection open for 45 seconds.

The existing API is secured. The LLM route is a different application — one that happens to share the same domain and the same authentication header format.

Code and API security

Why LLM Routes Inherit the Wrong Security Assumptions

Every backend engineer applies a mental model to new endpoints based on the endpoints they've already built. That model covers authentication, authorization, input validation, and rate limiting. It works for REST endpoints operating on structured data with bounded response sizes.

LLM routes violate four assumptions simultaneously.

Assumption 1: Input size is bounded by payload type. A JSON endpoint receives a structured payload. LLM endpoints receive freeform text — and in many implementations, the size constraint that exists for the JSON payload has no equivalent for the content within it.

A user who sends a 50KB wall of text as their "message" field may be:

Attempting to exhaust your context window to observe where truncation occurs
Padding a prompt injection payload with noise to bypass content filters
Using your endpoint as a proxy to process documents at your cost
Running a cost-drain attack — knowing that large inputs to large models generate large bills

Assumption 2: Response size is predictable. Standard APIs return responses bounded by your data model. LLM responses are bounded by max_tokens — which in many sprint-built implementations was never explicitly set, defaulting to the provider's maximum.

Assumption 3: Requests are atomic and fast. Your WAF and rate limiter were tuned assuming requests complete in milliseconds. An LLM request that takes 30–45 seconds occupies a connection and consumes resources at a speed profile your protections weren't designed to evaluate.

Assumption 4: The endpoint behaviour is deterministic. Every other endpoint returns a response derived from your code and your data. The LLM endpoint returns a response derived from a third-party model that can be manipulated by its inputs.

The Security Hardening Checklist

1. Input size limits

const LLM_LIMITS = {
  maxMessageLength: 4000,      // characters
  maxConversationHistory: 10,  // turns
  maxTotalContextTokens: 8000, // estimated tokens
};

function validateLLMInput(userMessage, conversationHistory) {
  if (userMessage.length > LLM_LIMITS.maxMessageLength) {
    throw new ValidationError('Message exceeds maximum length');
  }
  
  if (conversationHistory.length > LLM_LIMITS.maxConversationHistory) {
    // Truncate to most recent N turns
    conversationHistory = conversationHistory.slice(-LLM_LIMITS.maxConversationHistory);
  }
  
  return { userMessage, conversationHistory };
}

2. Per-user rate limiting

Standard API rate limiting counts requests per IP or per API key. LLM rate limiting needs to count at multiple dimensions:

const rateLimiter = {
  // Per-user limits
  requestsPerMinute: 10,
  tokensPerDay: 50000,
  requestsPerDay: 100,
};

async function checkRateLimit(userId) {
  const minuteKey = `rate:${userId}:${Math.floor(Date.now() / 60000)}`;
  const dayKey = `rate:${userId}:${new Date().toISOString().slice(0, 10)}`;
  
  const [minuteCount, dayCount] = await Promise.all([
    redis.incr(minuteKey),
    redis.incr(dayKey),
  ]);
  
  // Set expiry on first increment
  if (minuteCount === 1) await redis.expire(minuteKey, 60);
  if (dayCount === 1) await redis.expire(dayKey, 86400);
  
  if (minuteCount > rateLimiter.requestsPerMinute) {
    throw new RateLimitError('Rate limit exceeded — try again in a minute');
  }
  
  if (dayCount > rateLimiter.requestsPerDay) {
    throw new RateLimitError('Daily limit reached');
  }
}

3. Explicit max_tokens and timeouts

async function callLLMSafely(messages, options = {}) {
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 30000); // 30-second timeout
  
  try {
    const response = await openai.chat.completions.create({
      model: 'gpt-4o',
      messages,
      max_tokens: options.maxTokens || 1000,  // Explicit limit, never unbounded
      temperature: 0.7,
      stream: options.stream || false,
    }, {
      signal: controller.signal,
    });
    
    return response;
  } finally {
    clearTimeout(timeout);
  }
}

4. Authorization scoping

Not all authenticated users should have equal access to LLM features. Authorization must be granular:

const LLM_FEATURE_PERMISSIONS = {
  basic: ['chat_support'],
  pro: ['chat_support', 'document_analysis', 'code_review'],
  enterprise: ['chat_support', 'document_analysis', 'code_review', 'bulk_processing'],
};

async function authorizeFeature(userId, feature) {
  const user = await getUser(userId);
  const allowedFeatures = LLM_FEATURE_PERMISSIONS[user.plan] || [];
  
  if (!allowedFeatures.includes(feature)) {
    throw new AuthorizationError(`Feature ${feature} not available on ${user.plan} plan`);
  }
}

Backend server room and infrastructure

5. Structured logging for LLM routes

Standard access logs capture method, path, status code, and response time. LLM routes need additional dimensions:

async function logLLMRequest(context) {
  await logger.info('llm_request', {
    userId: context.userId,
    feature: context.feature,
    inputTokensEstimated: estimateTokens(context.messages),
    outputTokens: context.response?.usage?.completion_tokens,
    totalTokens: context.response?.usage?.total_tokens,
    latencyMs: context.latencyMs,
    modelUsed: context.modelId,
    // Log a hash of the prompt, not the full content (PII concerns)
    promptHash: hash(context.messages[context.messages.length - 1].content),
    sessionId: context.sessionId,
    requestId: context.requestId,
  });
}

Token logging is essential for cost attribution and abuse detection. A user generating 10× the typical token volume is either power-using the feature legitimately or running an automated abuse pattern — and you cannot tell which without the data.

6. Cost controls and circuit breakers

const COST_LIMITS = {
  maxDailyCostPerUser: 0.50,    // $0.50 per user per day
  maxMonthlyCostPerUser: 10.00, // $10 per user per month
  alertThresholdPercent: 80,    // Alert when 80% of limit reached
};

async function checkCostLimit(userId, estimatedCost) {
  const dailySpend = await getCachedUserSpend(userId, 'day');
  
  if (dailySpend + estimatedCost > COST_LIMITS.maxDailyCostPerUser) {
    throw new CostLimitError('Daily AI usage limit reached');
  }
  
  if (dailySpend >= COST_LIMITS.maxDailyCostPerUser * COST_LIMITS.alertThresholdPercent / 100) {
    await alertOpsChannel(`User ${userId} approaching daily AI cost limit`);
  }
}

The Security Gap Comparison

Security Dimension	Core API	Typical Sprint LLM Route	Hardened LLM Route
Authentication	JWT + short expiry	API key check	JWT + feature authorisation
Input validation	Schema validation	None	Length + content limits
Rate limiting	Requests/minute	None	Requests + tokens/user/day
Output size control	Response bounded	Unbounded (no max_tokens)	Explicit max_tokens
Timeout	10s	None (45s+ possible)	30s hard abort
Logging	Full audit trail	None or basic	Token-level cost attribution
Cost controls	N/A	None	Per-user daily/monthly caps
Prompt injection	N/A	None	Input filtering + output validation

The LLM route needs every one of these controls to reach the same security posture as the API surrounding it. Most of them can be implemented in a day. None of them requires architectural rework. All of them should be in place before the feature ships.

The sprint built the feature. The follow-on sprint needs to build the security layer around it. The cost of that sprint is small. The cost of the incident it prevents is not.