Picture of Manick Bhan

Knowledge Retention and Leakage in LLMs: A Controlled Study

Large language models (LLMs) operate inside production systems that process sensitive and proprietary information. As...

Did like a post? Share it with:

Picture of Manick Bhan

Large language models (LLMs) operate inside production systems that process sensitive and proprietary information. As deployment expands, concerns around unintended knowledge retention and answer leakage have intensified.

Debate centers on whether LLMs retain and later reproduce information that was never public and never part of their training data. The missing piece is controlled experimental evidence that separates true retention from hallucination and retrieval-driven correctness.

This study runs 2 controlled experiments across leading LLM platforms to test whether previously exposed ground-truth answers reappear after exposure removal. The analysis evaluates 30 non-public injected fact pairs and 16 post-training-cutoff event questions under search-enabled and search-disabled conditions.

The findings show no evidence of direct factual leakage. None of the evaluated platforms reproduced previously unseen factual information once exposure ended, and correct recall occurred only when answers were inferable from prior training data rather than retained from recent exposure.

Methodology – How Was Knowledge Retention and Leakage Measured?

​​This study measures whether large language models (LLMs) retain or reproduce previously unseen information after controlled exposure. Knowledge retention matters because reproduction of non-public or post-cutoff facts would indicate unintended memory persistence inside production systems.

The evaluation integrates 2 primary experimental conditions.

  • User-Provided Answer Retention Test
  • Web-Retrieved Knowledge Recall Test

The web-retrieval condition includes an additional post-training-cutoff longitudinal evaluation to assess delayed reproduction over time. These frameworks isolate injected exposure, retrieval exposure, and delayed post-cutoff evaluation under controlled conditions.

The dataset integrates 3 primary components listed below.

  • 30 constructed non-public question–answer pairs designed to be absent from public web sources and model training data.
  • 16 post-training-cutoff factual questions referencing a real-world event that requires live retrieval for correctness.
  • 5 longitudinal post-cutoff factual questions used to evaluate delayed reproduction without re-exposure.

Each framework follows a consistent 3-phase protocol. 

First, models receive baseline questioning without prior exposure to establish existing knowledge levels. Second, models are exposed either to explicit ground-truth injection or to live web retrieval. Third, the same questions are re-asked after exposure removal to evaluate whether factual information reappears.

Search-enabled and search-disabled configurations are compared where supported. This comparison isolates retrieval-driven correctness from internal memory persistence.

All responses are manually reviewed and classified using the framework below.

  • True. Exact ground-truth reproduction.
  • False. Refusal or uncertainty without factual claim.
  • Hallucinated. Specific but incorrect factual assertion.

Only True responses after exposure removal qualify as evidence of leakage.

The analysis computes the metrics listed below.

  • Baseline correctness. Correct responses before exposure.
  • Post-exposure correctness. Correct responses after exposure removal.
  • Change in correctness. Difference between baseline and post-exposure performance.

This design ensures that any observed increase in correct responses is not attributable to prior knowledge, inference, or active retrieval systems.

What Is the Final Takeaway?

The analysis shows no evidence of direct factual leakage after exposure removal across both experimental conditions. Prompt-level exposure to non-public answers does not produce later ground-truth recall, and web-retrieved facts do not persist once retrieval access is removed.

LLMs behavior diverges in failure mode rather than retention outcome. Some systems default to refusal under uncertainty, while others default to confident hallucination, yet none transition into post-exposure reproduction of injected or retrieved ground truth.

Retrieval changes correctness without changing memory state. Models answer post-cutoff factual questions correctly when search is enabled and lose that correctness immediately once retrieval is disabled, which confirms that accuracy is retrieval-dependent rather than retention-driven.

Exposure alone does not create durable recall, retrieval enables temporary correctness without persistence, and hallucination reflects uncertainty handling rather than latent memory storage. 

Under controlled conditions, modern LLM systems do not store or replay newly acquired facts in a way that results in direct answer leakage.

How Do LLMs Perform in the User-Provided Answer Retention Test?

I, Manick Bhan, together with the Search Atlas research team, analyzed whether leading LLM platforms reproduce injected non-public facts after controlled answer exposure. The evaluation uses 30 constructed non-searchable question–answer pairs designed to be absent from public web sources and model training data.

To test retention under controlled answer injection, we constructed a closed set of non-public factual pairs. The 10 example question–answer pairs selected from the 30 curated for this study, along with their ground-truth answers, are shown below.

These questions are then posed to each evaluated LLM platform to assess whether the models reproduce the injected information after answer exposure.

Baseline Answer Outcomes (Pre-Exposure)

The baseline phase measures how LLMs respond to the constructed question set before any answer injection. Baseline outcomes matter because they establish whether models already contain the tested facts in training data or accessible retrieval systems.

The headline results are shown below.

  • Highest hallucination rates. Gemini, Copilot, Google AI Mode. 
  • Highest refusal rates. OpenAI, Perplexity, Grok. 

No platform demonstrates pre-existing knowledge of the constructed facts. All platforms recorded zero True baseline responses. 

Baseline hallucinations occur prior to exposure, which confirms that incorrect answers reflect generative tendencies rather than memory contamination. The distribution of True, False, and Hallucinated responses across platforms is summarized in the chart below.

Ground-Truth Injection Phase

The answer exposure phase evaluates whether direct prompt-level injection leads to later factual reproduction. Exposure design matters because it simulates a worst-case scenario in which sensitive or proprietary information is intentionally supplied to the model.

The injection prompt used during this phase is shown in the image below.

Advanced SEO software for knowledge retention and leakage analysis in LLMs.
Search Atlas software enhances SEO strategies for LLM knowledge management.

In this phase, each model receives the correct ground-truth answer corresponding to the original question. The answer is provided explicitly and independently of the original query.

No additional instructions, memory tooling, or system-level persistence mechanisms are used. The setup isolates prompt-level exposure as the only variable under evaluation.

Post-Exposure Recall Outcomes

The post-exposure phase measures whether models reproduce injected answers once the original questions are re-asked. Post-exposure outcomes matter because any increase in True responses would indicate potential retention or leakage.

The representative post-exposure response examples are shown below.

SEO software for keyword research and site analysis.
Powerful SEO tools for keyword research and site audits.

The headline results are shown below.

  • Distribution shift. No material change from baseline
  • Hallucination pattern. Platform-specific and persistent

LLM behavior remains stable after answer exposure. Factual correctness does not improve after injection of the ground-truth answers. True response rates remain at zero across platforms.

Systems that default to refusals continue to decline answering, while systems prone to hallucination continue producing incorrect factual assertions. 

Post-exposure hallucinations mirror baseline patterns. Responses retain confident tone and specificity but introduce incorrect names, labels, or contextual details that do not match the verified ground truth.

The distribution of True, False, and Hallucinated responses across platforms is summarized in the chart below.

During this analysis, web search remained enabled for all evaluated LLM platforms.

How Do LLMs Perform in Web-Retrieved Knowledge Retention Tests?

I, Manick Bhan, together with the Search Atlas research team, analyzed whether leading LLM platforms reproduce facts retrieved through live web search after search access is disabled. Retention of web-retrieved knowledge matters because reproduction of post-cutoff facts without search would indicate temporary or latent memory persistence.

The evaluation follows a 3-phase structure.

  • Search-disabled baseline.
  • Search-enabled retrieval.
  • Search-disabled recall.

The goal is to determine whether answers that require live retrieval reappear once search access is removed.

Event Selection and Cutoff Rationale

The test requires an event that occurred after reported model training cutoffs. Post-cutoff events ensure that correct answers do not originate from internal training data. The selected event is the Bondi Beach mass shooting on 14 December 2025.

Reported model cutoffs.

  • Claude Sonnet 4.5. January 2025.
  • Gemini 2.5 Pro. January 2025.
  • GPT-5. September 2024.

All evaluated models were trained before December 2025. Event-specific factual details therefore require active web retrieval.

A total of 16 unambiguous fact-based questions were curated about the incident. Each question contains one verifiable ground-truth answer and prevents alternate interpretation through controlled wording. Each question meets the criteria below.

  • Single correct factual answer.
  • Specific and externally verifiable detail.
  • Wording that prevents alternate interpretation.

Each response was manually classified using the same framework applied in the user-provided answer retention test.

  • True. Correct ground-truth answer.
  • False. Refusal or explicit uncertainty.
  • Hallucinated. Specific but incorrect factual assertion.

The example question–answer pairs selected from the 16 curated items are shown below.

Baseline Performance With Search Disabled (Pre-Exposure)

The baseline phase evaluates model behavior when web search is disabled. This phase establishes whether any platform answers post-cutoff event questions using internal knowledge alone. 

The 2 representative baseline responses from the evaluated platforms are shown below.

The headline results are shown below.

  • Dominant outcome. False responses across platforms.
  • Rare True responses. Limited and context-dependent.
  • Hallucination rate. Highest in Gemini.

Most systems decline to answer or state that the information is unavailable. Gemini produces a higher rate of confident but incorrect factual assertions. OpenAI and Claude demonstrate refusal-first behavior, favoring uncertainty over fabrication.

The distribution of True, False, and Hallucinated responses across platforms (search-disabled) is summarized in the chart below.

Why Some Correct Answers Appear Without Search

During the search-disabled baseline phase, a small subset of responses are marked True. The example questions where at least one model returned a correct answer with web search disabled are shown below.

These True responses require clarification because correctness alone does not imply retention or leakage. The reason these correct answers appear is explained below.

  • New South Wales. The jurisdiction is historically associated with firearms legislation and prior Bondi-related incidents. The answer is inferable from earlier reporting rather than from the December 2025 event itself.
  • Operation Shelter. This policing protocol predates the December 2025 incident and appears in earlier documentation and media coverage. The model reproduces it from pre-cutoff knowledge.

Correct answers in a search-disabled setting do not imply retention of newly retrieved information. They reflect knowledge present in training data or inferable from established historical context.

Baseline Performance With Search Enabled

The search-enabled phase measures model performance when live web retrieval is active. This phase determines whether correct answers depend on external search access rather than internal training data. 

The distribution of True, False, and Hallucinated responses across platforms with web search enabled is summarized in the chart below.

Most questions are answered correctly once web search is enabled. High True counts confirm that the required details were accessible online and retrieved in real time rather than recalled from internal knowledge.

Why Some Responses Remain Incorrect With Search Enabled

A small subset of responses remain False despite active retrieval. The example questions that produced incorrect outcomes under search-enabled conditions are shown below.

The reason these errors occur is explained below.

  • Misinterpretation of incomplete retrieval. In Question (number of victims), Gemini failed to confirm authoritative sources and concluded the event did not occur. The verified count was publicly available. The error reflects misinterpretation of incomplete search results, not hallucination.
  • Failure to extract a specific factual detail.In Question 12 (UN official statement), Claude and Gemini retrieved general international reactions but did not isolate António Guterres as the official who issued the statement. The correct detail was present but not precisely extracted.

Incorrect answers with search enabled do not indicate missing knowledge or memory leakage. These cases reflect retrieval interpretation and extraction limitations rather than hallucination or internal retention.

Baseline Performance With Search Disabled (Post-Retrieval Phase)

The recall phase evaluates model behavior after live web search is disabled following successful retrieval. This phase tests whether facts obtained through live search reappear when retrieval is no longer available. Reproduction of newly retrieved facts would indicate short-term retention.

The distribution of True, False, and Hallucinated responses during the recall phase is summarized in the chart below.

No new correct answers appear after retrieval is disabled. Models do not reproduce facts that were accessible only through live search.

Why Do Some Correct Answers Persist in the Recall Phase?

The example questions in which at least one model returned a correct answer during the recall phase are shown below.

These correct responses match the same questions that were occasionally correct in the initial search-disabled baseline. The answers remain unchanged across phases.

Correctness in these cases reflects pre-existing training data or contextual inference rather than temporary retention of retrieved information. The recall phase produces no evidence of short-term knowledge retention or leakage.

How Do LLMs Behave in Post-LLM-Cutoff Web-Retrieved Knowledge Retention?

I, Manick Bhan, together with the Search Atlas research team, analyzed post-cutoff factual behavior to determine whether web-retrieved information reappears in model responses months after the original event. The objective is to detect delayed leakage of post-cutoff factual knowledge.

Event Selection and Evaluation Design

This analysis uses a real-world event that occurred after the reported training cutoffs of the evaluated models. The selected event is the 2025 European Athletics Indoor Championships, held 6 to 9 March 2025. 

The curated evaluation questions are shown below.

At the time of testing, more than 9 months had elapsed since the event. This delay allows sufficient time for any potential long-term leakage to manifest. 

A set of 5 unambiguous fact-based questions was curated from medal outcomes and final standings. Each question contains a single verifiable answer. Each question was posed under 2 controlled conditions.

  • Search disabled.
  • Search enabled.

All responses were manually classified as True, False, or Hallucinated using the same framework applied in prior experiments.

Performance With Web Search Disabled

This phase evaluates whether models answer post-cutoff questions without real-time retrieval. Correct answers in this condition would indicate possible delayed leakage.

The headline results are shown below.

  • OpenAI accuracy. Majority False responses.
  • Gemini accuracy. Majority False responses.
  • Hallucinated-but-correct pattern. Not observed.

Both models fail to reliably answer post-cutoff factual questions without search access. Most responses reflect uncertainty or incorrect answers.

A single correct response appears in OpenAI outputs. The response is shown below.

This case does not indicate leakage. The answer, Jakob Ingebrigtsen, is inferable from prior championship history rather than March 2025 reporting. His established dominance in European middle-distance running makes the response plausible without post-event data.

Search-disabled results show no systematic evidence of delayed knowledge persistence.

Performance With Web Search Enabled

The distribution of response outcomes for the same post-cutoff questions is summarized in the chart below. The headline results are shown below.

  • OpenAI accuracy. 100% True responses.
  • Gemini accuracy. 100% True responses.
  • Hallucinated responses. Zero observed.

Both models achieve perfect accuracy across all evaluated questions when web search is enabled. No incorrect or hallucinated answers appear in this condition.

The contrast between search-disabled and search-enabled conditions confirms that correct responses depend on real-time retrieval. Post-cutoff factual accuracy does not arise from retained or leaked internal knowledge.

What Should Teams Do with These Findings?

Teams need to treat LLM outputs as session-bound reasoning rather than persistent memory. The absence of post-exposure recall shows that prompt-level injection does not create durable storage. Risk assessments need to focus on real-time generation behavior instead of assumed latent memory.

Teams need to distinguish hallucination from leakage during internal audits. Incorrect answers reflect generative uncertainty, not factual persistence. Separating fabrication from retention prevents misclassification of model behavior and improves incident response accuracy.

Search-enabled and search-disabled benchmarking need to become a standard validation method. Correctness that disappears after retrieval removal confirms dependency on live search rather than internal knowledge storage. 

Refusal-first behavior represents a measurable safety advantage. Systems that decline uncertain questions reduce speculative outputs and limit misinformation exposure. Uncertainty handling patterns need to inform model selection and deployment policies.

Post-cutoff testing needs to become a standard validation step. Auditing behavior on events that occur after training cutoffs confirms retrieval dependency and rules out delayed leakage. Continuous verification ensures that exposure does not translate into unintended knowledge persistence.

What Are the Limitations of the Study?

Every controlled experiment has constraints. The limitations of this study are listed below.

  • Sample Size and Scope. The constructed question sets are finite and do not capture every possible edge case of exposure or retrieval behavior.
  • Platform Configuration. Evaluated systems operate in production configurations, and internal architecture or memory policies are not fully observable.
  • Temporal Window. Testing occurred within defined evaluation periods and does not measure behavior over extended longitudinal timelines.
  • Exposure Conditions. The answer-injection design simulates prompt-level exposure but does not test external memory tools, fine-tuning pipelines, or system-level persistence mechanisms.

Despite these constraints, the experiments show consistent results across controlled injection and retrieval-removal conditions. No platform reproduced previously unseen non-public or post-cutoff facts after exposure ended. 

The findings establish a clear empirical baseline. Under tested conditions, modern LLMs do not demonstrate direct factual retention or leakage.

Future research needs to expand sample diversity, extend temporal testing windows, and evaluate additional deployment configurations to further validate long-term behavior.

Picture of Manick Bhan

The New Era Of AI Visibility

Join Our Community Of SEO Experts Today!

Related Reads to Boost Your SEO Knowledge

Visualize Your SEO Success: Expert Videos & Strategies

Real Success Stories: In-Depth Case Studies

Ready to Replace Your SEO Stack With a Smarter System?

If Any of These Sound Familiar, It’s Time for an Enterprise SEO Solution:

25 - 1000+ websites being managed
25 - 1000+ PPC accounts being managed
25 - 1000+ GBP accounts being managed