You Secured the API. The LLM Route You Added Last Week Wasn't.
The core API took six months to harden. Rate limiting. JWT authentication with short expiry windows. Input validation middleware that rejects malformed payloads before they touch business logic. WAF rules tuned to your traffic patterns. An audit log that captures every authenticated request. A security review that two senior engineers signed off on.
Then the product team asked for an AI feature. The sprint was two weeks. The LLM endpoint went in on Wednesday.
Here's what that endpoint has: an API key passed in the request header. A call to the LLM provider's API. A response returned to the client.
Here's what it doesn't have: the rate limiting your other endpoints have. Input size constraints. Any logging that captures what was actually sent to the model. Cost controls that would catch an abuse pattern before it generates a four-figure bill in a weekend. Authorization logic that checks whether the calling user is permitted to use the AI feature. A timeout that prevents the request from holding a connection open for 45 seconds.
The existing API is secured. The LLM route is a different application — one that happens to share the same domain and the same authentication header format.

Why LLM Routes Inherit the Wrong Security Assumptions
Every backend engineer applies a mental model to new endpoints based on the endpoints they've already built. That model covers authentication, authorization, input validation, and rate limiting. It works for REST endpoints operating on structured data with bounded response sizes.
LLM routes violate four assumptions simultaneously.
Assumption 1: Input size is bounded by payload type. A JSON endpoint receives a structured payload. LLM endpoints receive freeform text — and in many implementations, the size constraint that exists for the JSON payload has no equivalent for the content within it.
A user who sends a 50KB wall of text as their "message" field may be:
- Attempting to exhaust your context window to observe where truncation occurs
- Padding a prompt injection payload with noise to bypass content filters
- Using your endpoint as a proxy to process documents at your cost
- Running a cost-drain attack — knowing that large inputs to large models generate large bills
Assumption 2: Response size is predictable. Standard APIs return responses bounded by your data model. LLM responses are bounded by max_tokens — which in many sprint-built implementations was never explicitly set, defaulting to the provider's maximum.
Assumption 3: Requests are atomic and fast. Your WAF and rate limiter were tuned assuming requests complete in milliseconds. An LLM request that takes 30–45 seconds occupies a connection and consumes resources at a speed profile your protections weren't designed to evaluate.
Assumption 4: The endpoint behaviour is deterministic. Every other endpoint returns a response derived from your code and your data. The LLM endpoint returns a response derived from a third-party model that can be manipulated by its inputs.
The Security Hardening Checklist
1. Input size limits
const LLM_LIMITS = {
maxMessageLength: 4000, // characters
maxConversationHistory: 10, // turns
maxTotalContextTokens: 8000, // estimated tokens
};
function validateLLMInput(userMessage, conversationHistory) {
if (userMessage.length > LLM_LIMITS.maxMessageLength) {
throw new ValidationError('Message exceeds maximum length');
}
if (conversationHistory.length > LLM_LIMITS.maxConversationHistory) {
// Truncate to most recent N turns
conversationHistory = conversationHistory.slice(-LLM_LIMITS.maxConversationHistory);
}
return { userMessage, conversationHistory };
}
2. Per-user rate limiting
Standard API rate limiting counts requests per IP or per API key. LLM rate limiting needs to count at multiple dimensions:
const rateLimiter = {
// Per-user limits
requestsPerMinute: 10,
tokensPerDay: 50000,
requestsPerDay: 100,
};
async function checkRateLimit(userId) {
const minuteKey = `rate:${userId}:${Math.floor(Date.now() / 60000)}`;
const dayKey = `rate:${userId}:${new Date().toISOString().slice(0, 10)}`;
const [minuteCount, dayCount] = await Promise.all([
redis.incr(minuteKey),
redis.incr(dayKey),
]);
// Set expiry on first increment
if (minuteCount === 1) await redis.expire(minuteKey, 60);
if (dayCount === 1) await redis.expire(dayKey, 86400);
if (minuteCount > rateLimiter.requestsPerMinute) {
throw new RateLimitError('Rate limit exceeded — try again in a minute');
}
if (dayCount > rateLimiter.requestsPerDay) {
throw new RateLimitError('Daily limit reached');
}
}
3. Explicit max_tokens and timeouts
async function callLLMSafely(messages, options = {}) {
const controller = new AbortController();
const timeout = setTimeout(() => controller.abort(), 30000); // 30-second timeout
try {
const response = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
max_tokens: options.maxTokens || 1000, // Explicit limit, never unbounded
temperature: 0.7,
stream: options.stream || false,
}, {
signal: controller.signal,
});
return response;
} finally {
clearTimeout(timeout);
}
}
4. Authorization scoping
Not all authenticated users should have equal access to LLM features. Authorization must be granular:
const LLM_FEATURE_PERMISSIONS = {
basic: ['chat_support'],
pro: ['chat_support', 'document_analysis', 'code_review'],
enterprise: ['chat_support', 'document_analysis', 'code_review', 'bulk_processing'],
};
async function authorizeFeature(userId, feature) {
const user = await getUser(userId);
const allowedFeatures = LLM_FEATURE_PERMISSIONS[user.plan] || [];
if (!allowedFeatures.includes(feature)) {
throw new AuthorizationError(`Feature ${feature} not available on ${user.plan} plan`);
}
}

5. Structured logging for LLM routes
Standard access logs capture method, path, status code, and response time. LLM routes need additional dimensions:
async function logLLMRequest(context) {
await logger.info('llm_request', {
userId: context.userId,
feature: context.feature,
inputTokensEstimated: estimateTokens(context.messages),
outputTokens: context.response?.usage?.completion_tokens,
totalTokens: context.response?.usage?.total_tokens,
latencyMs: context.latencyMs,
modelUsed: context.modelId,
// Log a hash of the prompt, not the full content (PII concerns)
promptHash: hash(context.messages[context.messages.length - 1].content),
sessionId: context.sessionId,
requestId: context.requestId,
});
}
Token logging is essential for cost attribution and abuse detection. A user generating 10× the typical token volume is either power-using the feature legitimately or running an automated abuse pattern — and you cannot tell which without the data.
6. Cost controls and circuit breakers
const COST_LIMITS = {
maxDailyCostPerUser: 0.50, // $0.50 per user per day
maxMonthlyCostPerUser: 10.00, // $10 per user per month
alertThresholdPercent: 80, // Alert when 80% of limit reached
};
async function checkCostLimit(userId, estimatedCost) {
const dailySpend = await getCachedUserSpend(userId, 'day');
if (dailySpend + estimatedCost > COST_LIMITS.maxDailyCostPerUser) {
throw new CostLimitError('Daily AI usage limit reached');
}
if (dailySpend >= COST_LIMITS.maxDailyCostPerUser * COST_LIMITS.alertThresholdPercent / 100) {
await alertOpsChannel(`User ${userId} approaching daily AI cost limit`);
}
}
The Security Gap Comparison
| Security Dimension | Core API | Typical Sprint LLM Route | Hardened LLM Route |
|---|---|---|---|
| Authentication | JWT + short expiry | API key check | JWT + feature authorisation |
| Input validation | Schema validation | None | Length + content limits |
| Rate limiting | Requests/minute | None | Requests + tokens/user/day |
| Output size control | Response bounded | Unbounded (no max_tokens) | Explicit max_tokens |
| Timeout | 10s | None (45s+ possible) | 30s hard abort |
| Logging | Full audit trail | None or basic | Token-level cost attribution |
| Cost controls | N/A | None | Per-user daily/monthly caps |
| Prompt injection | N/A | None | Input filtering + output validation |
The LLM route needs every one of these controls to reach the same security posture as the API surrounding it. Most of them can be implemented in a day. None of them requires architectural rework. All of them should be in place before the feature ships.
The sprint built the feature. The follow-on sprint needs to build the security layer around it. The cost of that sprint is small. The cost of the incident it prevents is not.





