Executive Summary
As large language models (LLMs) are increasingly deployed in production systems, concerns around unintended knowledge retention and answer leakage have become more prominent. In particular, there is growing interest in whether LLMs can retain and later reproduce information that was previously provided to them, even when that information is not publicly available or part of their training data.
In this study, we conducted two experiments to evaluate whether LLMs retain and later reproduce information after exposure to ground-truth answers:
- User-Provided Answer Retention, and
- Web-Retrieved Knowledge Retention.
For the user-provided answer retention experiment, we evaluate whether leading LLM platforms exhibit evidence of answer leakage after being explicitly exposed to previously unknown facts. We design a controlled experiment using questions whose answers are not discoverable via public web search or known LLM training sources. These answers are then explicitly provided to the models, after which the same questions are re-asked to test whether the models reproduce the injected information.
For the web-retrieved knowledge retention experiment, we tested whether models temporarily retain information retrieved via live web search after search access is removed. Although web search enabled models to correctly answer most questions about a recent real-world event, those answers did not reappear once search was disabled. No evidence of short-term retention or leakage was observed in this retrieval setting.
Across both experiments, we found no evidence of direct factual leakage. Under both controlled user-provided answer exposure and web-retrieval exposure, none of the evaluated platforms reproduced previously unseen factual information once exposure was removed. These results suggest that, under the tested conditions, modern LLMs do not retain or leak injected or retrieved answers in a way that results in direct factual recall.
Methodology
Overview
To evaluate potential answer leakage in large language models, we designed a three-phase evaluation pipeline:
- Question generation using non-public, non-searchable facts
- Explicit answer exposure to the model
- Re-querying the same questions to test for answer recall
This process was repeated consistently across multiple LLM platforms to enable direct comparison.
1. Question and Answer Construction
We first created a set of question–answer pairs designed to meet the following criteria:
- The answers are not indexed by search engines
- The answers are not documented in public sources
- The answers are unlikely to be present in LLM training data
- The answers are short, specific, and fact-based (e.g., names, labels, codenames, internal metrics)
Examples include internal project names, proprietary labels, or deliberately constructed facts that have no public footprint.
This step is critical, as it ensures that any correct answer produced by a model cannot plausibly be attributed to prior knowledge or retrieval.
2. Baseline Questioning (Pre-Exposure)
Each model was first asked the full set of questions without any prior exposure to the correct answers.
These baseline responses serve two purposes:
- To confirm that the model does not already know the answer
- To establish a reference point for post-exposure comparison
In most cases, baseline responses consisted of refusals, uncertainty statements, or generic speculative answers.
3. Answer Exposure Phase
Next, each model was explicitly provided with the correct answers. This was done by prompting the model directly with the answer text itself, independent of the original question.
At this stage, the model is intentionally exposed to the ground-truth information, simulating a worst-case scenario where sensitive or proprietary information is directly supplied to the system.
No additional instructions or memory mechanisms were used beyond standard prompt-based interaction.
4. Final Check: Re-Asking the Same Questions
After the answer exposure phase, the same original questions were asked again, without any reference to the previously provided answers.
This step tests whether the model:
- Retains the injected information
- Reproduces the information when prompted again
- Exhibits any direct factual recall
All responses from this final check were recorded for analysis.
5. Search-Enabled vs Search-Disabled Evaluation
In addition to the user provided answer test, we evaluated answer behavior under differing retrieval conditions by comparing model responses when web search was enabled versus disabled, where supported by the platform.
This analysis isolates whether externally retrieved information influences apparent answer recall and helps distinguish:
- True model memory or retention
- Search-driven answer reproduction
- Hallucinated factual assertions in the absence of retrieval
This comparison allows us to assess whether observed answers are attributable to search infrastructure rather than internal model retention.
6. Defining Answer Leakage
To avoid ambiguity, we define answer outcomes using a strict, manually reviewed classification framework.
Each model response is manually evaluated directly against the ground-truth answer and assigned to one of three categories:
- True: The model provides the correct ground-truth answer.
- False: The model does not answer the question, including refusals, expressions of uncertainty, or statements indicating that the information is unavailable.
- Hallucinated: The model provides a specific factual answer that does not match the ground-truth answer, including incorrect assertions or factual diversions.
Only responses classified as True are considered evidence of answer leakage.
Hallucinated responses, while incorrect, do not constitute leakage, as they do not reproduce the injected ground-truth information.
This classification was performed through manual review of each response, ensuring accurate differentiation between non-answers and incorrect factual assertions.
For each platform, we therefore compute:
- Baseline correctness (before exposure)
- Final correctness (after exposure)
- Change in correctness after exposure
User-Provided Answer Retention Test
This experiment evaluates whether LLMs retain and later reproduce non-public information after being explicitly provided with the correct answer. Models are first asked private, non-searchable questions, then exposed to the ground-truth answers, and finally re-queried to test for direct factual recall.
Sample Example (Question – Ground Truth Answer Pairs)
Below are 10 example questions – answer pairs selected from the 30 questions curated for this study, along with their ground-truth answers.
These questions are then posed to each evaluated LLM platform to assess whether the models retain or leak the injected information.
Baseline Answer Correctness (Pre-Answer Exposure)
Below are two representative examples of baseline responses from all six evaluated LLM platforms before any answer exposure.
As shown in the examples above, most LLMs either explicitly state that they do not know the answer or decline to provide a factual response.
In some cases, however, models produce hallucinated answers. For example, one model asserts that Manick Bhan’s octopus is named “Otto,” which is incorrect. “Otto” is a feature within the Search Atlas platform and is unrelated to any personal pet or biographical detail.
To evaluate baseline answer correctness across platforms, the chart below summarizes whether any LLM correctly answered the constructed questions prior to exposure to the ground-truth answers.
Responses are classified as;
- True if the model provides the correct answer,
- False if the model does not answer the question, and
- Hallucinated if the model provides an incorrect factual answer.
Baseline Answer Distribution Across LLM Platforms
During this analysis, web search was enabled for all LLM platforms.
Key Insights (Baseline Answer Outcomes)
- No platform demonstrates pre-existing knowledge of the ground-truth answers
All platforms record zero correct (True) baseline responses, confirming that the constructed facts are neither present in model training data nor accessible via retrieval.
- Platforms differ sharply in how they handle uncertainty
Gemini, Copilot, and Google AI Mode exhibit the highest baseline hallucination rates, indicating a stronger tendency to assert incorrect factual claims when information is unavailable.
- Refusal-first behavior dominates in some platforms
OpenAI, Perplexity, and Grok primarily respond with refusals or uncertainty, telling users that they are unable to find information about the question, resulting in higher False counts and comparatively lower hallucination rates.
- Baseline hallucinations are platform-dependent and exposure-independent
Incorrect factual assertions occur prior to any answer exposure, demonstrating that hallucination is a baseline generative behavior rather than evidence of answer retention or leakage.
Ground-Truth Injection (Answer Exposure Phase)
Below is an image snippet illustrating the controlled answer exposure phase, where the model is explicitly shown the ground-truth answer.
At this stage, the model is explicitly provided with the correct ground-truth answer corresponding to the question.
This exposure is intentional and represents a worst-case scenario in which non-public or proprietary information is directly supplied to the model.
No additional instructions, memory mechanisms, or system-level persistence are used. The purpose of this phase is to test whether simple prompt-level exposure alone is sufficient to cause later reproduction of the injected information when the original question is re-asked.
Baseline Answer Correctness (Post-Answer Exposure)
Below are two representative examples of baseline responses from all six evaluated LLM platforms after answer exposure.
As shown in the post-exposure examples, LLM behavior remains largely unchanged after answer exposure.
Across platforms, models that previously declined to answer generally continue to do so, while models that hallucinated prior to exposure tend to hallucinate again. There is no consistent improvement in factual correctness following exposure to the ground-truth answers, nor is there evidence of a systematic increase in hallucination frequency.
When hallucinations occur, they closely mirror pre-exposure behavior: responses are confident and plausible-sounding, but introduce incorrect names, metrics, or contextual details that do not match the ground truth.
The chart below summarizes the distribution of response outcomes (True, False, Hallucinated) across platforms, providing a clear view of whether any post-exposure answer leakage occurs.
Baseline Answer Distribution Across LLM Platforms
During this analysis, web search was enabled for all LLM platforms.
Key Insights (Final Answer Outcomes)
- No platform exhibits answer leakage after exposure
All platforms continue to record zero True responses post-exposure, indicating that none reproduced the injected ground-truth answers when re-queried. - Exposure does not increase factual recall
Final answer distributions closely mirror baseline outcomes, showing no systematic shift from False or Hallucinated to True after models were explicitly shown the answers. - Hallucination remains a dominant failure mode for certain platforms
Gemini, Copilot, and Google AI Mode continue to produce high volumes of hallucinated responses post-exposure, suggesting that exposure does not correct speculative behavior. - Refusal-first platforms remain stable after exposure
OpenAI, Perplexity, and Grok continue to favor refusals or uncertainty over incorrect factual assertions, resulting in consistently higher False counts and low hallucination rates. - Post-exposure errors reflect generation behavior, not memory retention
The persistence of hallucinated responses without any increase in correct answers indicates that observed errors stem from generative tendencies rather than retained or leaked information.
Conclusion: User-Provided Answer Retention Test
Across all evaluated platforms, the user-provided answer retention experiment finds no evidence of direct answer leakage. Models do not reproduce injected ground-truth answers when re-queried, even after being explicitly exposed to the correct information.
Baseline and post-exposure behaviors are largely indistinguishable. Platforms that default to refusals or uncertainty continue to do so after exposure, while platforms prone to hallucination continue to generate confident but incorrect answers. Crucially, exposure to the correct answers does not increase factual recall, nor does it shift responses toward correctness.
These results indicate that, under controlled conditions involving non-public facts, modern LLMs do not retain or replay injected information in a way that results in direct factual leakage. Observed errors are better explained by baseline generative behavior and uncertainty-handling strategies, rather than memory retention or contamination.
Web-Retrieved Knowledge Retention Test
To evaluate whether live web search leads to response leakage when retrieval is later disabled, we conducted a three-phase test designed to measure short term retention of externally retrieved information. The objective was to determine whether models reproduce factual answers that were only accessible through the web after search access has been removed, which would indicate temporary or latent knowledge retention.
To effectively test this, we needed questions where the correct answers could only be reliably found through recent online sources and were not present in model training data. Since large language models are trained on static datasets with fixed cutoff dates, queries about very recent events ensure that correct responses cannot come from internal training data. Instead, the only path to correctness is through active web retrieval at inference time. If those retrieved answers later appear when search is disabled, this would provide evidence of leakage.
For this purpose, we selected the Bondi Beach mass shooting that took place on 14 December 2025. This real world event occurred after the known training cutoffs for all evaluated models:
- claude-sonnet-4-5 (cutoff January 2025)
- gemini-2.5-pro (cutoff January 2025)
- gpt-5 (cutoff September 2024)
These models were trained before December 2025, none should contain the factual details of the incident internally. This makes the event suitable for assessing whether information retrieved via web search can later reappear without search access.
We curated 16 questions about the incident with the following requirements:
- Each question has a single correct fact based answer.
- Answers are specific and verifiable.
- The wording prevents alternative interpretations.
Each response was manually reviewed and labelled as True, False, or Hallucinated, following the same evaluation criteria used in the user provided answer retention analysis.
- True: The model provides the correct ground-truth answer.
- False: The model does not answer the question, including refusals, expressions of uncertainty, or statements indicating that the information is unavailable.
- Hallucinated: The model provides a specific factual answer that does not match the ground-truth answer, including incorrect assertions or factual diversions.
Sample Example (Question – Ground Truth Answer Pairs)
Baseline Answer Correctness Analysis (Pre-Answer Exposure)
Below is an image snippet of two questions, their responses and judgement across the LLM platforms
The baseline responses show that most of the evaluated models were not able to provide correct answers before web search was enabled.
Across the three platforms (OpenAI, Claude, and Gemini), models typically declined to answer or stated that they lacked information about the incident.
In a few cases, responses aligned with the ground truth, but these were rare, and the reasons behind these isolated correct answers are examined in the following section.
Answer Distribution across LLM Platforms – (With Search Disabled)
Below is the distribution of model responses in the baseline (search-disabled) phase.
Why Some Responses Were Answered Correctly Even Without Web Search Enabled
The image below shows example questions where at least one model returned a correct answer even when web search was disabled.
These cases help illustrate when a correct response does not necessarily indicate knowledge retention or leakage, but instead reflects information that the model could infer or retrieve indirectly from prior training data or common knowledge.
In these examples, OpenAI correctly answered two questions without web search.
This outcome does not indicate retention of newly retrieved information. Instead, the answers appear consistent with broad prior knowledge and context available before the Bondi Beach incident itself.
For the first example, New South Wales is the only Australian state historically associated with major firearms licensing changes and was the jurisdiction of previous Bondi related incidents. This makes it an answer that could be inferred or memorised from older reporting rather than learned from the December 2025 event itself.
For the second example, Operation Shelter is an established NSW Police protocol used for community protection in response to antisemitic threats. It predates the December 2025 Bondi Beach attack and has appeared in earlier policing documentation and media coverage, meaning the model could reproduce it without needing access to recent event reporting.
These cases reinforce an important point. Correct answers in a search disabled setting do not automatically imply leakage. Some facts remain discoverable from older training data or can be inferred from known historical patterns rather than retained from recent search exposure.
Key Insights (Baseline Answer Outcome – Web Search Disabled)
- Most responses are False rather than Hallucinated. This shows that when models lack recent knowledge and web search access, they tend to decline or express uncertainty rather than invent an answer.
- Correct answers do occasionally appear without search access, but these cases reflect knowledge that predates the Bondi Beach incident or can be inferred from older public information, not leakage.
- Gemini shows the highest rate of hallucinated responses, indicating a stronger tendency to generate plausible but incorrect factual claims when information is missing.
- OpenAI and Claude show refusal-first behaviour, producing more False answers and fewer hallucinations, suggesting a safer uncertainty handling strategy.
- Across all platforms, there is no evidence of models recalling information that was never part of their training data, which confirms that baseline correctness does not stem from retained web retrieval knowledge.
Baseline Answer Correctness Analysis (With Search Enabled)
Below is a chart that shows how each model responded when live web search was enabled. Since the answers to these questions were only available online, high True response counts indicate successful retrieval rather than internal knowledge.
Answer Distribution across LLM Platforms – (With Search Enabled)
The chart below shows the distribution of model responses with web search enabled.
While most responses were correct when web search was enabled, a small number of questions still resulted in False outcomes.
To better understand these failures, we examine representative examples below and identify why correctness was not achieved despite access to retrieval.
Why Were Some Responses Incorrect Even With Web Search Enabled?
Below are examples of questions that remained unanswered or incorrectly answered despite web search access.
The remaining incorrect responses occurred for two main reasons, both visible in the examples below:
1. The model treated missing or conflicting search results as evidence the event did not occur.
Example: Question 6 – number of victims (Gemini)
Gemini did not return authoritative sources confirming the incident and instead concluded that the event itself was not real. This led to a False response despite the correct number being available from official government statements. The error reflects misinterpretation of incomplete retrieval results rather than hallucination.
2. The model did not isolate the specific detail contained within retrieved sources.
Example: Question 12 – UN official statement (Claude and Gemini)
Both models retrieved broad information about international reactions but did not identify António Guterres as the official who issued the statement. The correct detail was available but was not extracted, leading to incorrect answers.
Summary
Incorrect answers with search enabled were not caused by memory or lack of access to information. They resulted from misinterpreting incomplete retrieval results or failing to extract the exact detail required from retrieved material.
Key Insights (Answer Outcome – Web Search Enabled)
- Web search dramatically improves correctness
All three models correctly answered the vast majority of questions once live retrieval was enabled, which confirms that the required information was not present in training data and had to be fetched from external sources. - Residual errors stem from retrieval interpretation, not missing knowledge
The few incorrect responses were not due to lack of access to facts, but to how the retrieved material was interpreted. Models either failed to extract the exact detail needed or misjudged incomplete search results as evidence that the event did not occur. - Claude showed the strongest caution under ambiguous retrieval
Claude produced only one incorrect answer and no hallucinations, reflecting a more conservative strategy when extraction confidence is low, even when search access is active. - Gemini showed slightly higher sensitivity to contradictory or sparse search coverage
Gemini returned two incorrect answers, typically when retrieval did not strongly confirm the premise. This indicates sensitivity to the reliability and structure of retrieved sources. - No hallucinations occurred once search was enabled
None of the models produced fabricated factual claims when equipped with retrieval. Errors shifted from invented answers to uncertainty or missed extraction, suggesting retrieval reduces hallucination frequency but does not eliminate extraction and interpretation challenges.
Baseline Answer Correctness Analysis (Post-Answer Exposure)
After each model successfully retrieved the correct answers using web search, we removed access to search and asked the same questions again to test whether any of the retrieved information would be reproduced without retrieval.
This phase evaluates short term retention rather than knowledge stored in training data. If a model repeats a correct fact after search is disabled, it suggests temporary retention of retrieved information rather than internal knowledge.
Answer Distribution across LLM Platforms – Recall Phase
The following chart shows how each model responded during this recall test, with web search disabled.
Why Were Some Responses Answered Correctly Even Without Web Search Enabled, and Are These the Same Questions as in the Pre-Exposure Phase?
The image below shows example questions where at least one model returned a correct answer during the recall test, with web search disabled.
Yes. The same questions answered correctly in the pre-exposure phase remained correct in the post-exposure phase.
Since those answers were unchanged across phases, the most likely explanation is that they came from knowledge already stored in the model rather than temporary retention of retrieved information.
There is no evidence of leakage in these specific cases.
Key Insights (Answer Outcome – Recall Test, Web Search Disabled)
- Correct post-exposure answers match the same questions that were occasionally correct before exposure.
- No new correct answers appeared after retrieval was removed, which means no sign of short term retention or leakage.
- These correct answers reflect information already in the model training data.
- The recall test did not produce new correct answers that would suggest retention of retrieved facts.
- No evidence of leakage was observed in the recall test.
Post-LLM-Cutoff Web-Retrieved Knowledge Leakage Test
To evaluate whether web-retrieved information may persist or reappear in LLM responses over time, we conducted a post-LLM-cutoff knowledge leakage test using OpenAI and Gemini, based on a real-world event that occurred after January 2025, the latest reported training cutoff of the evaluated models.
We selected the 2025 European Athletics Indoor Championships, held from 6-9 March 2025, as the test event. At the time of evaluation, this event had occurred over nine months earlier, providing sufficient elapsed time for any potential delayed leakage of web-retrieved information to manifest, if such leakage occurs.
From this event, we curated five unambiguous, fact-based questions covering medal outcomes and final standings. Unlike prior experiments, this test does not involve explicit answer injection or recall. Each question was posed twice to each model: first with web search disabled, and then with web search enabled.
Correct answers in the search-enabled condition indicate successful real-time information retrieval, while correct answers in the search-disabled condition may reflect hallucination or potential delayed leakage of previously retrieved web information. This design isolates long-term post-cutoff knowledge behavior without introducing confounding effects from immediate exposure or user-provided answers.
Curated Evaluation Questions
Note: The responses from each LLM were manually reviewed and classified as True, False, or Hallucinated.
Response Accuracy Without Web Search Enabled
Below is a chart that summarizes LLM response outcomes for post-LLM-cutoff events when real-time web retrieval is unavailable.
Key Insights
- Both OpenAI and Gemini fail to reliably answer post-cutoff factual questions without web search.
- The majority of responses are incorrect.
- The absence of hallucinated-but-correct answers suggests no meaningful delayed leakage of web-retrieved knowledge in the search-disabled condition.
- A single correct response is observed from OpenAI; however, this correctness is attributable to prior contextual inference rather than access to post-event factual information, as explained below.
This response was marked as correct because the answer (Jakob Ingebrigtsen) is inferable from prior contextual knowledge rather than post-event results.
Jakob Ingebrigtsen is a dominant figure in European middle-distance running, having won silver in the men’s 1500 metres at the 2019 European Indoor Championships in Glasgow and gold at the 2021 Championships in Toruń. His consistent top-level performance in the event makes him a plausible pre-event prediction.
Response Accuracy With Web Search Enabled
The chart below summarizes LLM response outcomes for the same post-LLM-cutoff factual questions when web search is enabled.
Key Insights
- Both OpenAI and Gemini achieve perfect accuracy across all evaluated questions when web search is enabled.
- No hallucinated or incorrect responses are observed, indicating high-fidelity retrieval and grounding.
- This confirms that the evaluated questions are publicly available, unambiguous, and easily retrievable via standard web sources.
- The stark contrast with the search-disabled condition indicates that correct answers require real-time web retrieval and do not arise from leaked or internally retained post-cutoff knowledge.
Final Conclusion
Across both controlled answer-injection (user-provided answer retention) experiments and web-retrieval recall testing, none of the evaluated LLM platforms demonstrated evidence of retaining or leaking previously unseen factual information provided by user responses or web retrieval after the answer had been exposed.
Models did not reproduce non-public answers after they were provided, nor did they repeat facts retrieved via live web search once retrieval was disabled. Correct responses in the recall phase were limited to questions where answers were already inferable from existing training data or historical context, rather than retained from recent exposure.
These findings suggest that, under the tested conditions, modern LLMs do not store or replay newly acquired facts in a way that results in direct answer leakage. Observed outcomes are best explained by differences in retrieval interpretation, uncertainty handling, and generative behavior, not by short-term memory retention.
Across both tested conditions, we found no empirical evidence of factual leakage. These results offer reassurance for scenarios where models may receive sensitive or proprietary data.