Backend Development

The LLM Response Your Cache Stored Is Now Leaking to the Wrong Session

A fintech team built LLM response caching to cut costs by 62%. By Friday morning, users were seeing each other's account summaries. No breach. No attack. Just a cache key that didn't include the user ID. Here's the architectural mistake and how to prevent it.

Meritshot Team7 min read
SecurityCachingLLMBackend DevelopmentPrivacyArchitecture
Back to Blog

The LLM Response Your Cache Stored Is Now Leaking to the Wrong Session

The engineering team had done everything right — or so it appeared.

They'd built a conversational AI assistant for a fintech platform. Users could query their account balances, transaction history, and personalised financial recommendations. The system used response caching to reduce API costs and improve latency. Repeated or semantically similar queries would return cached responses rather than hitting the LLM endpoint every time.

In staging, it was clean. Fast. Cost-efficient. The latency dropped by 62%. The demo looked great. The feature shipped on Thursday evening.

By Friday morning, three users had filed support tickets saying they were seeing account summaries that didn't belong to them.

The LLM hadn't been breached. The database hadn't been accessed unauthorised. The authentication layer was functioning correctly.

The cache had been serving personalised, user-specific LLM responses to users who hadn't generated them — because the cache key was built on the query content alone, without the user identity as part of the key namespace.

Security and data protection

Why This Problem Is Different From Standard Cache Poisoning

Classic cache poisoning — where an attacker injects a malicious response into a shared cache — is a well-documented class of web vulnerability. Practitioners know it and build against it.

The LLM session leakage problem is different in a way that makes it harder to catch.

In standard cache poisoning, an attacker does something deliberate to poison the cache. There is an adversarial action. In LLM session leakage, no adversarial action is required. The leakage happens through normal, legitimate user behaviour — because two users ask semantically similar questions, and a cache layer that doesn't distinguish between them serves one user's personalised response to the other.

The trigger is not malice. The trigger is similarity.

Three properties make LLM responses particularly dangerous to cache naively:

  • They are personalised by design. The value proposition of most LLM applications is that responses are tailored to the user's context, history, and data. That personalisation is exactly what makes a leaked response a privacy violation.
  • They carry implicit PII even when not explicitly queried. A response about "your typical spending pattern" or "based on your income level" contains personal financial data without the user ever asking for their PII directly.
  • They appear authoritative and specific. A cached response from a different user's context doesn't look wrong — it looks like accurate, specific information. The receiving user has no reason to flag it as incorrect.

How the Cache Key Gets Built Wrong

The typical implementation path that leads to this misconfiguration looks like this:

A developer wants to add caching to their LLM endpoint. They identify the natural cache key as the user's query:

// The first, naive implementation
async function getCachedResponse(userMessage) {
  const cacheKey = `llm:${hash(userMessage)}`;
  
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);
  
  const response = await callLLM(userMessage);
  await redis.setex(cacheKey, 3600, JSON.stringify(response));
  return response;
}

This works perfectly for identical queries — and appears to work correctly in testing, where a single tester repeats their own queries. The flaw only manifests when multiple users with different data contexts ask semantically similar questions.

Two users both ask "what's my current balance?". The hash is identical. The cache serves user A's balance summary to user B.

The Architecture Pattern That Prevents It

Rule 1: Always scope cache keys to the user

The cache key must include a stable user identifier that cannot be inferred or guessed by other users:

async function getCachedResponse(userId, userMessage) {
  // Scope the cache key to the user
  const cacheKey = `llm:user:${userId}:${hash(userMessage)}`;
  
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);
  
  const response = await callLLM(userId, userMessage);
  await redis.setex(cacheKey, 3600, JSON.stringify(response));
  return response;
}

The user ID is the namespace. Two users asking identical questions get separate cache entries. Cache entries from one user's session can never be served to another user.

Rule 2: Include response personalisation factors in the cache key

If your LLM responses vary based on factors beyond the user message — conversation history, user preferences, feature flags, A/B test cohorts — those factors must be part of the cache key:

function buildCacheKey(userId, userMessage, contextFactors) {
  const keyComponents = {
    userId,
    messageHash: hash(userMessage),
    // Include any factors that affect response personalisation
    planTier: contextFactors.subscriptionTier,
    locale: contextFactors.locale,
    conversationId: contextFactors.conversationId,
  };
  
  return `llm:${hash(JSON.stringify(keyComponents))}`;
}

Rule 3: Never cache responses that contain personalised data without user scoping

This seems obvious after an incident. It needs to be a pre-deployment checklist item.

Categorise your LLM responses:

Response TypeCan Cache Globally?Cache Key Scope
Generic factual answersYesQuery only
Product informationYesQuery + version
User-specific recommendationsNoUser ID + query
Account summariesNoUser ID + query + timestamp
Anything with PIINoUser ID only (or no cache)

Data architecture and backend security

Semantic Caching Is Higher Risk

Semantic caching — where you cache based on embedding similarity rather than exact query hash — amplifies this risk significantly.

With hash-based caching, two queries must be byte-for-byte identical to hit the same cache entry. With semantic caching, queries are considered equivalent if their vector representations are within a similarity threshold. "What's my balance?" and "Show me my account balance" might hash differently but embed similarly.

This is useful for reducing API costs. It is also a much wider attack surface for cross-user leakage.

If you implement semantic caching:

async function getSemanticallyCachedResponse(userId, userMessage) {
  const queryEmbedding = await embed(userMessage);
  
  // CRITICAL: Search only within this user's cache namespace
  const similarCached = await vectorDB.search({
    vector: queryEmbedding,
    filter: { userId },  // This filter is mandatory, not optional
    topK: 1,
    threshold: 0.95,
  });
  
  if (similarCached.length > 0) {
    return similarCached[0].response;
  }
  
  const response = await callLLM(userId, userMessage);
  
  // Store with user scoping
  await vectorDB.upsert({
    id: `${userId}:${generateId()}`,
    vector: queryEmbedding,
    metadata: { userId, response },
  });
  
  return response;
}

The filter: { userId } on every vector search is not optional. Without it, semantic search returns results from any user's cached entries.

Invalidation: The Second Failure Mode

Even correctly scoped caches can serve stale data that is still "wrong" for the user, even if it belongs to them.

A user's account balance at 9 AM is not the correct answer to "what's my balance?" at 4 PM after three transactions.

Implement TTL-based invalidation appropriate to your data's volatility:

const CACHE_TTL_BY_QUERY_TYPE = {
  accountBalance: 60,        // 1 minute — balance changes with transactions
  transactionHistory: 300,   // 5 minutes
  spendingInsights: 3600,    // 1 hour — slower-moving analytics
  genericProductInfo: 86400, // 24 hours — mostly static
};

And implement explicit invalidation on data change events:

async function onTransactionCompleted(userId) {
  // Invalidate all balance and transaction caches for this user
  const pattern = `llm:user:${userId}:*`;
  const keys = await redis.keys(pattern);
  if (keys.length > 0) {
    await redis.del(...keys);
  }
}

The Pre-Deployment Checklist

Before shipping any LLM caching implementation:

  • Verify cache keys include the user ID for any personalised response
  • Verify that semantic search, if used, filters by user ID before similarity matching
  • Test with two separate user accounts making identical queries — confirm they receive different responses
  • Set TTLs appropriate to data volatility (not just the longest acceptable latency)
  • Implement cache invalidation triggers for data change events
  • Audit what PII is present in cached responses and whether TTL is appropriate

Cache poisoning requires an attacker. LLM session leakage requires only two users asking similar questions. The misconfiguration is endemic — and unlike traditional cache poisoning, it leaves no attacker signature and goes undetected until a user files a support ticket about someone else's data.

Recommended