Manick Bhan
Founder CEO/CTO

Knowledge Retention and Leakage in LLMs: A Controlled Study

Large language models (LLMs) operate inside production systems that process sensitive and proprietary information. As...

Did like a post? Share it with:

Manick Bhan
Founder CEO/CTO

Large language models (LLMs) operate inside production systems that process sensitive and proprietary information. As deployment expands, concerns around unintended knowledge retention and answer leakage have intensified.

Debate centers on whether LLMs retain and later reproduce information that was never public and never part of their training data. The missing piece is controlled experimental evidence that separates true retention from hallucination and retrieval-driven correctness.

This study runs 2 controlled experiments across leading LLM platforms to test whether previously exposed ground-truth answers reappear after exposure removal. The analysis evaluates 30 non-public injected fact pairs and 16 post-training-cutoff event questions under search-enabled and search-disabled conditions.

The findings show no evidence of direct factual leakage. None of the evaluated platforms reproduced previously unseen factual information once exposure ended, and correct recall occurred only when answers were inferable from prior training data rather than retained from recent exposure.

Methodology – How Was Knowledge Retention and Leakage Measured?

This study measures whether large language models (LLMs) retain or reproduce previously unseen information after controlled exposure. Knowledge retention matters because reproduction of non-public or post-cutoff facts would indicate unintended memory persistence inside production systems.

The evaluation integrates 2 primary experimental conditions.

User-Provided Answer Retention Test
Web-Retrieved Knowledge Recall Test

The web-retrieval condition includes an additional post-training-cutoff longitudinal evaluation to assess delayed reproduction over time. These frameworks isolate injected exposure, retrieval exposure, and delayed post-cutoff evaluation under controlled conditions.

The dataset integrates 3 primary components listed below.

30 constructed non-public question–answer pairs designed to be absent from public web sources and model training data.
16 post-training-cutoff factual questions referencing a real-world event that requires live retrieval for correctness.
5 longitudinal post-cutoff factual questions used to evaluate delayed reproduction without re-exposure.

Each framework follows a consistent 3-phase protocol.

First, models receive baseline questioning without prior exposure to establish existing knowledge levels. Second, models are exposed either to explicit ground-truth injection or to live web retrieval. Third, the same questions are re-asked after exposure removal to evaluate whether factual information reappears.

Search-enabled and search-disabled configurations are compared where supported. This comparison isolates retrieval-driven correctness from internal memory persistence.

All responses are manually reviewed and classified using the framework below.

True. Exact ground-truth reproduction.
False. Refusal or uncertainty without factual claim.
Hallucinated. Specific but incorrect factual assertion.

Only True responses after exposure removal qualify as evidence of leakage.

The analysis computes the metrics listed below.

Baseline correctness. Correct responses before exposure.
Post-exposure correctness. Correct responses after exposure removal.
Change in correctness. Difference between baseline and post-exposure performance.

This design ensures that any observed increase in correct responses is not attributable to prior knowledge, inference, or active retrieval systems.

What Is the Final Takeaway?

The analysis shows no evidence of direct factual leakage after exposure removal across both experimental conditions. Prompt-level exposure to non-public answers does not produce later ground-truth recall, and web-retrieved facts do not persist once retrieval access is removed.

LLMs behavior diverges in failure mode rather than retention outcome. Some systems default to refusal under uncertainty, while others default to confident hallucination, yet none transition into post-exposure reproduction of injected or retrieved ground truth.

Retrieval changes correctness without changing memory state. Models answer post-cutoff factual questions correctly when search is enabled and lose that correctness immediately once retrieval is disabled, which confirms that accuracy is retrieval-dependent rather than retention-driven.

Exposure alone does not create durable recall, retrieval enables temporary correctness without persistence, and hallucination reflects uncertainty handling rather than latent memory storage.

Under controlled conditions, modern LLM systems do not store or replay newly acquired facts in a way that results in direct answer leakage.

How Do LLMs Perform in the User-Provided Answer Retention Test?

I, Manick Bhan, together with the Search Atlas research team, analyzed whether leading LLM platforms reproduce injected non-public facts after controlled answer exposure. The evaluation uses 30 constructed non-searchable question–answer pairs designed to be absent from public web sources and model training data.

To test retention under controlled answer injection, we constructed a closed set of non-public factual pairs. The 10 example question–answer pairs selected from the 30 curated for this study, along with their ground-truth answers, are shown below.

These questions are then posed to each evaluated LLM platform to assess whether the models reproduce the injected information after answer exposure.

Baseline Answer Outcomes (Pre-Exposure)

The baseline phase measures how LLMs respond to the constructed question set before any answer injection. Baseline outcomes matter because they establish whether models already contain the tested facts in training data or accessible retrieval systems.

The headline results are shown below.

Highest hallucination rates. Gemini, Copilot, Google AI Mode.
Highest refusal rates. OpenAI, Perplexity, Grok.

No platform demonstrates pre-existing knowledge of the constructed facts. All platforms recorded zero True baseline responses.

Baseline hallucinations occur prior to exposure, which confirms that incorrect answers reflect generative tendencies rather than memory contamination. The distribution of True, False, and Hallucinated responses across platforms is summarized in the chart below.

Ground-Truth Injection Phase

The answer exposure phase evaluates whether direct prompt-level injection leads to later factual reproduction. Exposure design matters because it simulates a worst-case scenario in which sensitive or proprietary information is intentionally supplied to the model.

The injection prompt used during this phase is shown in the image below.

Advanced SEO software for knowledge retention and leakage analysis in LLMs. — Search Atlas software enhances SEO strategies for LLM knowledge management.

In this phase, each model receives the correct ground-truth answer corresponding to the original question. The answer is provided explicitly and independently of the original query.

No additional instructions, memory tooling, or system-level persistence mechanisms are used. The setup isolates prompt-level exposure as the only variable under evaluation.

Post-Exposure Recall Outcomes

The post-exposure phase measures whether models reproduce injected answers once the original questions are re-asked. Post-exposure outcomes matter because any increase in True responses would indicate potential retention or leakage.

The representative post-exposure response examples are shown below.

SEO software for keyword research and site analysis. — Powerful SEO tools for keyword research and site audits.

The headline results are shown below.

Distribution shift. No material change from baseline
Hallucination pattern. Platform-specific and persistent

LLM behavior remains stable after answer exposure. Factual correctness does not improve after injection of the ground-truth answers. True response rates remain at zero across platforms.

Systems that default to refusals continue to decline answering, while systems prone to hallucination continue producing incorrect factual assertions.

Post-exposure hallucinations mirror baseline patterns. Responses retain confident tone and specificity but introduce incorrect names, labels, or contextual details that do not match the verified ground truth.

The distribution of True, False, and Hallucinated responses across platforms is summarized in the chart below.

During this analysis, web search remained enabled for all evaluated LLM platforms.

How Do LLMs Perform in Web-Retrieved Knowledge Retention Tests?

I, Manick Bhan, together with the Search Atlas research team, analyzed whether leading LLM platforms reproduce facts retrieved through live web search after search access is disabled. Retention of web-retrieved knowledge matters because reproduction of post-cutoff facts without search would indicate temporary or latent memory persistence.

The evaluation follows a 3-phase structure.

Search-disabled baseline.
Search-enabled retrieval.
Search-disabled recall.

The goal is to determine whether answers that require live retrieval reappear once search access is removed.

Event Selection and Cutoff Rationale

The test requires an event that occurred after reported model training cutoffs. Post-cutoff events ensure that correct answers do not originate from internal training data. The selected event is the Bondi Beach mass shooting on 14 December 2025.

Reported model cutoffs.

Claude Sonnet 4.5. January 2025.
Gemini 2.5 Pro. January 2025.
GPT-5. September 2024.

All evaluated models were trained before December 2025. Event-specific factual details therefore require active web retrieval.

A total of 16 unambiguous fact-based questions were curated about the incident. Each question contains one verifiable ground-truth answer and prevents alternate interpretation through controlled wording. Each question meets the criteria below.

Single correct factual answer.
Specific and externally verifiable detail.
Wording that prevents alternate interpretation.

Each response was manually classified using the same framework applied in the user-provided answer retention test.

True. Correct ground-truth answer.
False. Refusal or explicit uncertainty.
Hallucinated. Specific but incorrect factual assertion.

The example question–answer pairs selected from the 16 curated items are shown below.

Baseline Performance With Search Disabled (Pre-Exposure)

The baseline phase evaluates model behavior when web search is disabled. This phase establishes whether any platform answers post-cutoff event questions using internal knowledge alone.

The 2 representative baseline responses from the evaluated platforms are shown below.

The headline results are shown below.

Dominant outcome. False responses across platforms.
Rare True responses. Limited and context-dependent.
Hallucination rate. Highest in Gemini.

Most systems decline to answer or state that the information is unavailable. Gemini produces a higher rate of confident but incorrect factual assertions. OpenAI and Claude demonstrate refusal-first behavior, favoring uncertainty over fabrication.

The distribution of True, False, and Hallucinated responses across platforms (search-disabled) is summarized in the chart below.

Why Some Correct Answers Appear Without Search

During the search-disabled baseline phase, a small subset of responses are marked True. The example questions where at least one model returned a correct answer with web search disabled are shown below.

These True responses require clarification because correctness alone does not imply retention or leakage. The reason these correct answers appear is explained below.

New South Wales. The jurisdiction is historically associated with firearms legislation and prior Bondi-related incidents. The answer is inferable from earlier reporting rather than from the December 2025 event itself.
Operation Shelter. This policing protocol predates the December 2025 incident and appears in earlier documentation and media coverage. The model reproduces it from pre-cutoff knowledge.

Correct answers in a search-disabled setting do not imply retention of newly retrieved information. They reflect knowledge present in training data or inferable from established historical context.

Baseline Performance With Search Enabled

The search-enabled phase measures model performance when live web retrieval is active. This phase determines whether correct answers depend on external search access rather than internal training data.

The distribution of True, False, and Hallucinated responses across platforms with web search enabled is summarized in the chart below.

Most questions are answered correctly once web search is enabled. High True counts confirm that the required details were accessible online and retrieved in real time rather than recalled from internal knowledge.

Why Some Responses Remain Incorrect With Search Enabled

A small subset of responses remain False despite active retrieval. The example questions that produced incorrect outcomes under search-enabled conditions are shown below.

The reason these errors occur is explained below.

Misinterpretation of incomplete retrieval. In Question (number of victims), Gemini failed to confirm authoritative sources and concluded the event did not occur. The verified count was publicly available. The error reflects misinterpretation of incomplete search results, not hallucination.
Failure to extract a specific factual detail.In Question 12 (UN official statement), Claude and Gemini retrieved general international reactions but did not isolate António Guterres as the official who issued the statement. The correct detail was present but not precisely extracted.

Incorrect answers with search enabled do not indicate missing knowledge or memory leakage. These cases reflect retrieval interpretation and extraction limitations rather than hallucination or internal retention.

Baseline Performance With Search Disabled (Post-Retrieval Phase)

The recall phase evaluates model behavior after live web search is disabled following successful retrieval. This phase tests whether facts obtained through live search reappear when retrieval is no longer available. Reproduction of newly retrieved facts would indicate short-term retention.

The distribution of True, False, and Hallucinated responses during the recall phase is summarized in the chart below.

No new correct answers appear after retrieval is disabled. Models do not reproduce facts that were accessible only through live search.

Why Do Some Correct Answers Persist in the Recall Phase?

The example questions in which at least one model returned a correct answer during the recall phase are shown below.

These correct responses match the same questions that were occasionally correct in the initial search-disabled baseline. The answers remain unchanged across phases.

Correctness in these cases reflects pre-existing training data or contextual inference rather than temporary retention of retrieved information. The recall phase produces no evidence of short-term knowledge retention or leakage.

How Do LLMs Behave in Post-LLM-Cutoff Web-Retrieved Knowledge Retention?

I, Manick Bhan, together with the Search Atlas research team, analyzed post-cutoff factual behavior to determine whether web-retrieved information reappears in model responses months after the original event. The objective is to detect delayed leakage of post-cutoff factual knowledge.

Event Selection and Evaluation Design

This analysis uses a real-world event that occurred after the reported training cutoffs of the evaluated models. The selected event is the 2025 European Athletics Indoor Championships, held 6 to 9 March 2025.

The curated evaluation questions are shown below.

At the time of testing, more than 9 months had elapsed since the event. This delay allows sufficient time for any potential long-term leakage to manifest.

A set of 5 unambiguous fact-based questions was curated from medal outcomes and final standings. Each question contains a single verifiable answer. Each question was posed under 2 controlled conditions.

Search disabled.
Search enabled.

All responses were manually classified as True, False, or Hallucinated using the same framework applied in prior experiments.

Performance With Web Search Disabled

This phase evaluates whether models answer post-cutoff questions without real-time retrieval. Correct answers in this condition would indicate possible delayed leakage.

The headline results are shown below.

OpenAI accuracy. Majority False responses.
Gemini accuracy. Majority False responses.
Hallucinated-but-correct pattern. Not observed.

Both models fail to reliably answer post-cutoff factual questions without search access. Most responses reflect uncertainty or incorrect answers.

A single correct response appears in OpenAI outputs. The response is shown below.

This case does not indicate leakage. The answer, Jakob Ingebrigtsen, is inferable from prior championship history rather than March 2025 reporting. His established dominance in European middle-distance running makes the response plausible without post-event data.

Search-disabled results show no systematic evidence of delayed knowledge persistence.

Performance With Web Search Enabled

The distribution of response outcomes for the same post-cutoff questions is summarized in the chart below. The headline results are shown below.

OpenAI accuracy. 100% True responses.
Gemini accuracy. 100% True responses.
Hallucinated responses. Zero observed.

Both models achieve perfect accuracy across all evaluated questions when web search is enabled. No incorrect or hallucinated answers appear in this condition.

The contrast between search-disabled and search-enabled conditions confirms that correct responses depend on real-time retrieval. Post-cutoff factual accuracy does not arise from retained or leaked internal knowledge.

What Should Teams Do with These Findings?

Teams need to treat LLM outputs as session-bound reasoning rather than persistent memory. The absence of post-exposure recall shows that prompt-level injection does not create durable storage. Risk assessments need to focus on real-time generation behavior instead of assumed latent memory.

Teams need to distinguish hallucination from leakage during internal audits. Incorrect answers reflect generative uncertainty, not factual persistence. Separating fabrication from retention prevents misclassification of model behavior and improves incident response accuracy.

Search-enabled and search-disabled benchmarking need to become a standard validation method. Correctness that disappears after retrieval removal confirms dependency on live search rather than internal knowledge storage.

Refusal-first behavior represents a measurable safety advantage. Systems that decline uncertain questions reduce speculative outputs and limit misinformation exposure. Uncertainty handling patterns need to inform model selection and deployment policies.

Post-cutoff testing needs to become a standard validation step. Auditing behavior on events that occur after training cutoffs confirms retrieval dependency and rules out delayed leakage. Continuous verification ensures that exposure does not translate into unintended knowledge persistence.

What Are the Limitations of the Study?

Every controlled experiment has constraints. The limitations of this study are listed below.

Sample Size and Scope. The constructed question sets are finite and do not capture every possible edge case of exposure or retrieval behavior.
Platform Configuration. Evaluated systems operate in production configurations, and internal architecture or memory policies are not fully observable.
Temporal Window. Testing occurred within defined evaluation periods and does not measure behavior over extended longitudinal timelines.
Exposure Conditions. The answer-injection design simulates prompt-level exposure but does not test external memory tools, fine-tuning pipelines, or system-level persistence mechanisms.

Despite these constraints, the experiments show consistent results across controlled injection and retrieval-removal conditions. No platform reproduced previously unseen non-public or post-cutoff facts after exposure ended.

The findings establish a clear empirical baseline. Under tested conditions, modern LLMs do not demonstrate direct factual retention or leakage.

Future research needs to expand sample diversity, extend temporal testing windows, and evaluate additional deployment configurations to further validate long-term behavior.

Manick Bhan
Founder CEO/CTO

Manick Bhan is a 3x INC 5000 Founder CEO/CTO of Search Atlas which is an AI SEO automation platform used by thousands of brands and agencies and awarded Best SEO Platform by the Global Search Awards, Shortlisted by Capterra, Front Runners by Software Advice, Category Leaders by GetApp, and best tool for customer satisfaction and usability by Gartner.

Manick Bhan founded LinkGraph, a digital marketing firm that helps enterprise brands and agencies scale through data-driven SEO with clients like Shutterfly and Samsung. LinkGraph is listed as one of the Fastest Growing Private Companies in the US by inc.5000, as one of the Best Workplaces in Advertising & Marketing by Fortune, as New York’s B2B Leaders by Clutch, won no.1 Spot in Nevada’s Top Workplaces, Best B2B SEO Campaign by The Drum Awards for Search, and named Best Start-Up Agency at U.S. Search Awards.

Manick Bhan is the owner for Signal Genesys, the leading platform for automated press release distribution and digital presence management, and LinkLaboratory, the largest online publisher catalog in the world.

With 10+ years of experience in SEO from the in-house and agency side, Manick Bhan has taught both startups and Fortune 500 companies how to scale their brands with a data-driven SEO strategy that can break into any market and outrank even the biggest of competitors. Bhan’s innovative approach to SEO has helped Search Atlas and LinkGraph scale to multiple 8 figures.

Manick's thought leadership has appeared in leading publications like Forbes, Search Engine Journal (SEJ), VentureBeat, G2, Digital Summit, Wordstream, Wix SEO Hub, Wordable, Inc. Masters, AllBusiness, SEO Blog, Jumpstory, Serpstat, Outbrain, Improvado, Unstack, Clickbank, Built in, Martechseries, Smartbrief, Marketingprofs, Readwrite, Honeybook, Content Marketing Institute, LocalIQ, CXL, Oncrawl, Venture Beat, Addicted2Success, Search Engine Watch, Business 2 Community, Digital Connect MAG, and VegasInc.

Manick Bhan is a speaker at events like TechCrunch Disrupt, Traffic & Conversion Summit, Ad World, HighLevel Summit, Chiang Mai SEO, Merchant Mastery, SEO Week, AI Bot Summit, SEO Spring Training, LeadSnap Mansion Mastermind, SEOROCKSTARS, LeadSnapEvents, DigiMarCon, brightonSEO, Affiliate Summit West, Traffic and Conversion Summit, Outranking Summit, TES Affiliate Conference, billo Summit, ContentTECH Summit, Content Marketing Conference, VEGPRENEUR Expert Hour, Ai4 Conference, SMX West, and Affiliate Summit West.

Manick Bhan is the Founder CEO/CTO of the SEOTheory community, a community designed for agency owners looking to increase their SEO results.

Manick Bhan enjoys writing and speaking on topics that range from digital marketing to artificial intelligence and machine learning to social impact in the animal welfare and environmental space.

Manick lives in Medellin, Colombia with his wife Sophia Deluz-Bhan, daughter Ruby, and a house full of animals including Voodoo the SEO cat.

The New Era Of AI Visibility

Join Our Community Of SEO Experts Today!

Visualize Your SEO Success: Expert Videos & Strategies

Play

Real Success Stories: In-Depth Case Studies

Business name:

Dr. David McInnis Orthodontics (dmsmile.com)

472% Organic Traffic Growth & 380% More Patient Conversions in 6 Months

The Challenge:

Dr. David McInnis Orthodontics struggled with low search visibility and inconsistent patient inquiries. Despite offering premium orthodontic services, their online presence failed to generate steady leads.

472% increase in organic traffic

380% growth in patient inquiries & conversions

250+ high-intent keywords ranking on Page 1

53% lower cost-per-acquisition

How We Did It:

By implementing Search Atlas’s advanced SEO strategy, we restructured their website for search intent alignment, optimized local SEO, and enhanced technical performance to dominate Google rankings.

Now, Dr. David McInnis Orthodontics enjoys a steady stream of organic leads and a powerful online presence, making them the go-to orthodontic practice in their area.

Business name:

Rehab Facility

Rehab Facility Dominates SERP with 1400+ Keywords in Top 3

The Challenge:

Their mission is to provide clients with all the tools necessary to tackle addiction at its source. To do this, they needed to significantly increase their online presence and support their crucial mission.

+277% Organic Traffic

+ 135% Organic Keywords

1400 + Keywords Ranking Top 3

659% referring domains increased

How We Did It:

The client utilized Search Atlas to identify and resolve technical flaws, including broken links, slow loading times, and navigation issues. With OTTO, they performed these fixes and optimizations in one day.

Business name:

DUI Law Firm

Making an Austin DUI Law Firm a Local Reference with OTTO

The Challenge:

In Austin’s bustling legal market, standing out as a DUI law firm is challenging due to intense competition. Achieving local search visibility requires an innovative strategic SEO approach.

+100% Pins Improved

+88% Locations Ranking Top 3

+88% Higher Positions in Local Searches

How We Did It:

To improve search rankings for their keywords, we incorporated these terms into the website and Google Business Profile (GBP) over 4 weeks using OTTO. After OTTO implementation, 100% of the pins are ranking either in top 3 or top 5 local search positions.

OTTO’s automated SEO optimization process simplifies SEO efforts, reducing manual labor and allowing the team to focus on other crucial tasks.

Business name:

nonprofit sensory learning center

Nonprofit Climbs from #27 to #1 and Doubles Traffic with OTTO

The Challenge:

This center is dedicated to providing essential resources and programs for children with special needs and their families. Despite their valuable mission, the center’s website traffic had stalled for months, preventing them from connecting with potential clients.

+ 111% Organic Traffic

+75.5% Organic Keywords

Top 1 Ranking for Target Keyword

How We Did It:

To drive more traffic to their site, the client implemented OTTO’s recommendations. This included enhancing content quality, optimizing technical aspects of the site, refining on-page SEO elements, and building authority through the publication of 2 press releases.

The results were astounding. The client transitioned from being relatively obscure online to becoming a go-to resource in local search results for families seeking support.

Ready to Replace Your SEO Stack With a Smarter System?

If Any of These Sound Familiar, It’s Time for an Enterprise SEO Solution:

25 - 1000+ websites being managed

25 - 1000+ PPC accounts being managed

25 - 1000+ GBP accounts being managed

Knowledge Retention and Leakage in LLMs: A Controlled Study

Did like a post? Share it with:

Methodology – How Was Knowledge Retention and Leakage Measured?

What Is the Final Takeaway?

How Do LLMs Perform in the User-Provided Answer Retention Test?

Baseline Answer Outcomes (Pre-Exposure)

Ground-Truth Injection Phase

Post-Exposure Recall Outcomes

How Do LLMs Perform in Web-Retrieved Knowledge Retention Tests?

Event Selection and Cutoff Rationale

Baseline Performance With Search Disabled (Pre-Exposure)

Why Some Correct Answers Appear Without Search

Baseline Performance With Search Enabled

Why Some Responses Remain Incorrect With Search Enabled

Baseline Performance With Search Disabled (Post-Retrieval Phase)

Why Do Some Correct Answers Persist in the Recall Phase?

How Do LLMs Behave in Post-LLM-Cutoff Web-Retrieved Knowledge Retention?

Event Selection and Evaluation Design

Performance With Web Search Disabled

Performance With Web Search Enabled

What Should Teams Do with These Findings?

What Are the Limitations of the Study?

The New Era Of AI Visibility

Join Our Community Of SEO Experts Today!

Related Reads to Boost Your SEO Knowledge

28 Best Similarweb Alternatives (Free, Paid and Cheaper) in 2026

29 Best Semrush Alternatives (Free, Paid and Cheaper) in 2026

30 Best Serpstat alternatives (Free, Paid and Cheaper) in 2026

30 Best Rytr Alternatives (Free, Paid and Cheaper) in 2026

28 Perplexity AI Alternatives (Free, Paid and Cheaper) in 2026

31 Best SEO Powersuite Alternatives (Free, Paid, and Cheaper) in 2026

Visualize Your SEO Success: Expert Videos & Strategies

Real Success Stories: In-Depth Case Studies

472% Organic Traffic Growth & 380% More Patient Conversions in 6 Months

The Challenge:

472% increase in organic traffic

380% growth in patient inquiries & conversions

250+ high-intent keywords ranking on Page 1

53% lower cost-per-acquisition

How We Did It:

Rehab Facility Dominates SERP with 1400+ Keywords in Top 3

The Challenge:

+277% Organic Traffic

+ 135% Organic Keywords

1400 + Keywords Ranking Top 3

659% referring domains increased

How We Did It:

Making an Austin DUI Law Firm a Local Reference with OTTO

The Challenge:

+100% Pins Improved

+88% Locations Ranking Top 3

+88% Higher Positions in Local Searches

How We Did It:

Nonprofit Climbs from #27 to #1 and Doubles Traffic with OTTO

The Challenge:

+ 111% Organic Traffic

+75.5% Organic Keywords

Top 1 Ranking for Target Keyword

How We Did It:

Ready to Replace Your SEO Stack With a Smarter System?