Domain Industry Analysis in LLM Responses

You may also read a concise version of this research in our blog: Industry Patterns...

Did like a post? Share it with:

You may also read a concise version of this research in our blog: Industry Patterns in LLM Responses: A 5.17 Million Domain-Citation Analysis

Overview

This study examines 5,173,673 domain citations generated by large language models (LLMs) including OpenAI, Gemini, and Perplexity.

The dataset covers responses collected between August 25, 2025, and September 26, 2025 where websearch was disabled for both OpenAI and Gemini, but enabled for Perplexity. The analysis investigates how these models cite external web sources, focusing on which industries and domain types are most frequently referenced.

Results show that commercial websites account for the majority of citations, while academic and government domains are significantly underrepresented. This pattern suggests that LLMs rely more on publicly available commercial and media content than on authoritative or peer-reviewed sources. The findings highlight the importance of improving source diversity and credibility within LLM responses to ensure balanced and trustworthy information generation.

Executive Summary:

  • Purpose: Identify which industries are most represented among the web domains cited by large language models (LLMs).
  • Scope: 5.17 M citations from OpenAI, Gemini, and Perplexity between Aug 25 – Sep 26 2025.
  • Finding: Most citations come from commercial and news-media domains, while academic and government sites form a very small share.
  • Notable patterns:
    • News: concentration around major global outlets (Reuters, FT, Axios).
    • Reviews: strong reliance on structured reputation sources such as BBB.org and Yelp.
    • .com domains dominate; .gov and .edu appear rarely.
  • Interpretation: These results show where LLMs source information, not how correct or relevant those citations are.
  • Implication: Query intent and platform design both influence citation mix, with retrieval-based models surfacing broader and more current web material.

Methodology

Data Integration

Two datasets were used: the LLM response dataset and the domain classification dataset. The LLM response dataset contains detailed metadata describing each citation, while the classification dataset provides industry labels for known domains.

The combined dataset spans August 25 – September 26, 2025, covering 5.17 million citation records from 907,003 unique domains. Perplexity accounts for approximately 80% of all citations, followed by OpenAI (15%) and Gemini (5%).

Domain Normalization

All URLs were standardized to their base domain using the tldextract library. This ensured that URLs such as https://www.nytimes.com/section/tech were reduced to nytimes.com for consistent matching across datasets.

Domain Classification using an LLM

To categorize the 907,003 unique domains into their respective industries, a large language model (OpenAI GPT) was used to perform domain-level classification.

The LLM model was prompted with domain names and limited metadata to infer the most relevant industry category from 24 major categories (e.g., Technology, Finance, Healthcare, Media, or Government).

The resulting classifications were manually validated on a random sample to ensure accuracy. Classification accuracy was spot-checked on 200 random samples, confirming correct labeling. 

Domains labeled as “Other” by the LLM were excluded after verification; they accounted for 0.09% of the data.

Merging and Cleaning

The normalized domain data were merged with the classification results to assign an industry label to each citation record.

Rows labeled as “Other” or with missing industry data accounted for less than 0.01% of the dataset and were excluded to maintain consistency in the analysis.

Feature Engineering

New analytical variables were created to enable deeper exploration.

  • TLD Category: Each domain was grouped based on its top-level domain (e.g., .com, .org, .edu, .gov).
  • Source Category: Industries were mapped into four broader categories: Commercial, Academic/Government, News/Media, and Social/Blog.
  • Month Variable: Each timestamp was converted into a monthly period to facilitate trend analysis.

Data Distribution

The dataset is dominated by Perplexity citations, with smaller shares from OpenAI and Gemini. To ensure balanced insights, subsequent analyses are presented both overall and per platform.

distribution of citations by LLM platform

Perplexity contributes the vast majority of citations in the dataset, accounting for about 80% of all references. OpenAI follows with roughly 15%, while Gemini represents about 5%.

This imbalance reflects differences in model design and data collection volume rather than performance. Subsequent analyses account for this distribution by comparing results both overall and on a per-platform basis.

Industry Representation in LLM Citations

top industries cited in LLM answers

The chart shows that Technology, Healthcare, and Consumer Goods dominate LLM citations, indicating a strong bias toward commercially relevant and product-focused domains.

Media and Construction also appear prominently, reflecting frequent references to informational and service-based content.

In contrast, academic, government, and specialized sectors such as Agriculture or Telecommunications are sparsely cited, suggesting limited reliance on institutional or niche sources of information.

News Domains

5 news domains cited by LLMs

From the chart we can see that LLMs most frequently cite mainstream, high-credibility outlets such as reuters.com, ft.com, and axios.com.

These domains dominate citation counts, indicating that LLMs rely heavily on established global and financial news sources for factual grounding.

Beyond major media, a noticeable share of citations goes to press release and syndication platforms like openpr.com, einpresswire.com, and newswire.com, suggesting that models often pull from aggregated corporate or PR content.

Overall, the distribution reflects a mix of reputable journalism and promotional sources, emphasizing LLMs’ dependence on readily accessible, news-oriented content rather than original investigative or academic reporting.

Across all three models, a few domains consistently appear: reuters.com, ft.com, cnn.com, apnews.com, usnews.com.

But the quantity and ranking vary significantly:

  • Perplexity cites many more second-tier outlets (regional news sites, niche industry publishers, and press-release aggregators)
  • OpenAI sticks to a narrow core of highly established global media
  • Gemini references the same global media, but far less often, with overall lower citation counts.

Below are the top 25 News Domains cited per platform:

Perplexity

top 25 news domains cited by perplexity

Perplexity cites a wide range of niche and mid-tier news sources rather than primarily traditional global outlets. The top-referenced domains (like startus-insights.com, sfgate.com, cbsnews.com, and openpr.com) indicate a strong bias toward press-release distributors, regional publishers, and business-specific news wires.

This suggests Perplexity favors frequently updated, high-volume content sources rather than legacy editorial institutions.

OpenAI

top 25 news domains cited by openAI

OpenAI cites more established and reputable news outlets like Reuters, FT, APNews, and Axios, showing a stronger orientation toward mainstream journalism.

It also references analytical sites such as MarketResearchIntellect.com and TheGuardian.com, indicating a balance between factual reporting and specialized industry commentary.

Gemini

top 25 news domains cited by gemini

Gemini’s news citations are fewer and more selective, focused mainly on top-tier global publishers like NYTimes.com, Reuters.com, BBC.com, WSJ.com, and FT.com.

This pattern suggests that Gemini prioritizes authoritative, editorially verified news sources, reflecting tighter retrieval filters and limited breadth compared to Perplexity or OpenAI.

Review Domain

top 25 review domains cited by all LLMs

LLMs overwhelmingly cite bbb.org (Better Business Bureau), which dominates the review category by a wide margin, followed by yelp.com and chamberofcommerce.com.

This suggests that LLMs heavily depend on reputation and business-directory style platforms when referencing reviews or evaluations.

The strong presence of Trustpilot, Manta, and Trustburn further indicates a focus on customer feedback aggregators and credibility-oriented sources rather than consumer forums or informal review blogs.

Overall, the pattern shows that LLMs favor structured, organization-verified review sources over subjective or user-generated commentary.

Below are the Top 25 Domains Cited per platform.

Perplexity

top 25 review domains cited by perplexity

Perplexity overwhelmingly cites BBB.org, far ahead of all other review sites, showing a strong reliance on the Better Business Bureau as a credibility source.

Secondary domains like Fixr.com, Homewyse.com, and Eventective.com appear but at much lower volumes, indicating that Perplexity draws primarily from large, structured business and contractor directories rather than user-driven review platforms.

OpenAI

top 25 news domains cited by openai

OpenAI demonstrates a broader mix of consumer and business review platforms. Yelp.com and BBB.org lead, followed by ChamberofCommerce.com, Trustpilot.com, and Trustburn.com.

This distribution reflects OpenAI’s preference for mainstream, user-oriented review sites, balancing between public feedback aggregators and business credibility sources.

Gemini

top 25 review domains cited by gemini

Gemini’s citations are fewer but show a regional and consumer-service focus, highlighting domains like Yelp.com, BBB.org, and several Australian review sites such as ProductReview.com.au and ServiceSeeking.com.au.

This suggests Gemini’s retrieval scope emphasizes localized, service-based, and product evaluation platforms over large-scale global review aggregators.

What Major Industry Sectors Do LLMs Cite Most?

This section explores which industries are most frequently cited by large language models (LLMs). Each cited domain was categorized into its corresponding industry group, including Technology, Healthcare, Finance, and Media.

This breakdown helps identify the sectors that most influence LLM-generated responses and shows whether citations are concentrated within a few commercial industries or distributed more evenly across academic, institutional, and public sources.

share of cired industry groups in all LLMs

The chart reveals that LLMs overwhelmingly rely on commercial and industry-oriented web sources, with the most cited sectors being:

  • Technology (20.8%)

E.g. mapquest.com, google.com, birdeye.com, indeed.com, zoominfo.com.

  • Consumer & Retail (19.4%)

E.g. houzz.com, homeadvisor.com, homeguide.com, trane.com

  • Healthcare (10.3%)

E.g. healthline.com, mayoclinic.org, clevelandclinic.org, webmd.com

  • Business & Finance (9.2%)

E.g. nerdwallet.com, bankrate.com, deloitte.com, fastercapital.com

  • Construction & Manufacturing (9.2%)

E.g. servpro.com, billraganroofing.com, roof-crafters.com, carrier.com

Together, these five sectors account for nearly 70% of all citations, highlighting a strong bias toward general web and corporate content.

Academic and Institutional Sources

  • Academic/Government (9.0%) make up a much smaller slice of total citations, indicating that peer-reviewed and official institutional sources remain underrepresented in LLM-generated responses.
  • This aligns with the trend that models favor freely accessible web content over subscription-based or restricted academic material.

Media and Informal Sources

  • News/Media (8.1%) play a secondary role, providing event-driven and journalistic context rather than scholarly grounding.
  • Social/Blog (2.8%) and Review/Directory (1.8%) show that LLMs occasionally pull from user-generated or crowd-sourced content, though these sources remain marginal.

Specialized and Niche Domains

  • Transportation & Travel (6.8%) and Environmental & Agriculture (2.6%) show modest but notable representation. These categories likely stem from informational or commercial content related to travel, logistics, and sustainability topics.

Key Insights

  • LLMs prioritize breadth and accessibility by drawing from commercial, industry, and media ecosystems rather than academic, peer-reviewed, or government databases.
  • The Technology and Retail sectors dominate not because they’re inherently more credible, but because they produce vast volumes of web-accessible informational content.
  • Academic/Government sources, while small in volume, remain critical for factual accuracy and authority, suggesting potential gaps in LLM source diversity.

Citation Industry Composition by LLM Platform

The distribution of citations across industry groups shows clear differences in the types of domains each LLM relies on when forming responses.

This section highlights each platform’s citation behavior and reveals which industries dominate the information each model relies on.

citation industry composition by LLM platform

Platform Design Differences

Perplexity is a retrieval-first engine that depends heavily on live web access, whereas OpenAI and Gemini operate as general-purpose LLMs with a stronger reliance on internal reasoning, cached knowledge, and curated training data.

These architectural differences influence not just how many citations each model produces, but which industries they pull information from most frequently.

Industry-Level Citation Patterns Across Platforms

  1. Technology, Consumer, and Finance Industrial Sectors Dominate All Three Models

Across Gemini, OpenAI, and Perplexity, the largest portion of citations comes from Technology, Business & Finance, Consumer & Retail, and other commercially driven sectors.

These industries collectively make up the majority of citations across every platform, highlighting the structure of the open web: corporate, commercial, and product-oriented content is the most abundant and frequently indexed.

Perplexity shows the strongest emphasis on these commercial industries, consistent with its real-time scraping and live retrieval design.

  1. Academic and Government Industries Appear in Small Proportions

Academic/Government industries represent less than 10 percent of citations across all three platforms.

Gemini includes a modestly higher proportion of institutional sources compared to OpenAI and Perplexity, indicating a slightly broader citation base that incorporates more authoritative, non-commercial material.

  1. News and Media-Related Industries Are More Prominent in OpenAI

OpenAI cites News/Media content at higher rates than Gemini and Perplexity.
This suggests a stronger emphasis on journalistic, event-based, and reporting-oriented sources, aligning with OpenAI’s balanced use of both internal knowledge and curated external information.

Media/Social/Blog industries which include platforms like YouTube, Twitter/X, Medium, Substack, and similar domains remain relatively small across all models, typically under 5 percent of total citations.

  1. Perplexity Shows the Most Distinct Industrial Footprint

Perplexity’s distribution shows:

  • higher reliance on Consumer & Retail, Construction & Manufacturing, and Technology industries
  • lower reliance on institutional or social categories
  • more pronounced commercial orientation compared to Gemini and OpenAI

This pattern aligns with Perplexity’s architecture as a real-time search-powered engine that retrieves large volumes of commercially indexed content.

Key Insights

  • Commercial industries such as Technology, Business & Finance, Consumer & Retail, and Manufacturing dominate citation behavior across all platforms, reflecting the general structure of the internet rather than any platform-specific bias.
  • Academic/Government sources are consistently small, meaning LLMs rarely cite university research, government data, or formal academic material unless the query specifically requires it.
  • OpenAI has the most evenly spread across multiple industries, but exhibits the strongest emphasis on News/Media.
  • Perplexity is the most commercially skewed, and Gemini is strongly skewed toward Technology and Academic/Government.
  • These patterns suggest that LLMs tend to rely on broadly available public web material, especially technology, finance, retail, product, and consumer-oriented domains.

Citation Industry Composition by Query Intent 

The stacked bar chart below compares how different query intents such as acquisition, comparison, discovery, evaluation, informative, and understanding draw from various industry groups.

This shows which types of industries LLMs tend to reference depending on what the user is trying to achieve.

citation industry composition by query intent

Industry Patterns Across Intent Types

  1. Technology is the Dominant Industry for Every Intent

Across all query intents, Technology is consistently the single largest industry cited. This includes:

  • Software companies
  • Developer tools
  • Tech platforms
  • Product documentation

Whether a query is informational (“informative”), evaluative (“comparison”), or transactional (“acquisition”), LLMs pull heavily from technology-related sources.

  1. Consumer & Retail and Business & Finance Are Major Secondary Contributors

Nearly all query intents show strong contributions from:

  • Consumer & Retail (product pages, e-commerce sites, shopping guides)
  • Business & Finance (corporate pages, business explanations, financial service providers)

These industries form a significant share of citations for acquisition, comparison, and evaluation queries, where users often ask about products, services, tools, or solutions.

  1. Healthcare, Environmental, and Construction/Manufacturing Play Supporting Roles

Other industries such as:

  • Healthcare
  • Environmental & Agriculture
  • Construction & Manufacturing

Appear across all intent types but at lower levels, usually between 5 and 15 percent depending on the intent.


These industries tend to surface when queries require subject-specific information rather than broad explanations.

  1. Informative and Understanding Intents Show Greater Presence of News/Media

The informative and understanding intents show noticeably higher contributions from:

  • News/Media (journalism, reporting, informational outlets)
  • Social/Blog (public platforms such as YouTube, Medium, Twitter/X, Substack)

These intents often require background context or real-world explanations, which naturally draw on journalistic or publicly authored sources.

  1. Academic/Government Remains a Small Minority Across All Intents

Across all query types, the Academic/Government industry group which includes universities, government bodies, and public-sector information accounts for a very small share of citations.

Even for analytical or explanatory intents (“understanding,” “informative”), institutional domains remain rare.

Key Insights

  • Technology dominates every intent category, reflecting its central role in modern web content and its relevance to a wide variety of prompts.
  • Consumer & Retail, Business & Finance, and Construction & Manufacturing are consistently important for decision-oriented intents such as acquisition, comparison, and evaluation.
  • News/Media industries appear more often for informative and understanding intents, which depend more on contextual or narrative information.
  • Academic/Government citations are limited across all intents, indicating that LLMs rarely rely on institutional sources unless a query explicitly requires them.
  • Overall, query intent influences emphasis, but not the overall structure: every intent type still draws primarily from technology-driven and commercially oriented industries simply because those industries dominate the publicly available web ecosystem.

How Trophy Content Types Vary Across Industry Categories

Trophy Content is a Search Atlas framework that describes high-authority content formats found in webpage URLs, such as reviews, listicles, comparisons, case studies, awards, and media coverage. These formats help anchor a brand’s entity graph across the web.

In the context of LLM citation behavior, Trophy Content provides a structured way to analyze which types of authoritative webpages LLMs tend to cite and which industries produce these formats more frequently.

By examining the distribution of Trophy Content across industries, we can identify patterns in how LLMs reference evaluative, comparative, or evidence-driven content depending on the industry.

We scraped 17,351 cited URLs and passed their content to an LLM to categorize each page into one of the ten Trophy Content formats. The chart below shows the distribution of Trophy Content across industries.

trophy content distribution by industry

Key Insights

1. Listicles dominate across almost all industries:

LLMs most frequently cite “Top X / Best Y” style pages, regardless of industry.

2. Research and Reports are strongest in technical and regulated industries:

Academic, Government, Healthcare, Finance, and Environmental sectors show the largest share of research-driven citations.

3. Press & Media Coverage spikes only in media-heavy industries:

Media, News, and Real Estate have noticeably higher proportions of press coverage citations.

4. Reviews are concentrated in consumer-facing verticals:

Retail, Consumer Goods, Hospitality, and E-commerce show the strongest presence of review content.

5. Awards, Case Studies, and Comparisons appear at low levels

These formats exist but contribute only a small share across most industries.

6. Each industry shows a predictable content pattern

  • Regulated industries → Research
  • Consumer industries → Reviews + Listicles
  • Media categories → Press coverage
  • Services and B2B → Case studies + Comparisons (small but present)

Signal Genesys Citation Analysis in LLM Outputs

Signal Genesys is a Search Atlas product that enables clients to publish content across a large network of distribution partners, including local news outlets, broadcast media sites, financial newswire networks, and regional publishers.

To evaluate how frequently this network appears in LLM outputs, we matched all Signal Genesys domains (299 total) against the 5.17M citations data.

Domain Coverage in LLM Citations

Out of 299 total Signal Genesys domains, 258 appeared at least once in the sample of 5.17M LLM citations.

The remaining 41 did not appear in this sample, though they may still surface in other prompt distributions or model contexts.

signal genesys domains observed

This represents high network visibility, with approximately 86 percent of the distribution network appearing in LLM citations.

Top Signal Genesys Domains Cited by LLMs

We examined which individual Signal Genesys partner sites appear most frequently as citations.

Across the dataset, we identified 1,258 total citations referencing Signal Genesys domains.

Below is a chart of the top 15 most cited Signal Genesys domains.

Top Signal Genesys Domains Cited by LLMs

The most frequently cited domains belong primarily to regional news outlets and editorial publications within the SG syndication network. These include News Channel Nebraska affiliates, city magazines, business journals, and local newspapers.

Most-Cited Signal Genesys Content Groups

Signal Genesys domains were also grouped by their associated media category (e.g., broadcast news, digital news, financial newswire). The chart below shows citation volume by group.

most-cited signal genesys content groups in LLM citations

Key Insights

  • Broadcast News partners make up the largest share of citations, followed by Local & National Digital News Media.
  • Financial newswire and Canadian broadcast partners appear less frequently but are still represented.
  • Overall we can see that LLMs tend to reference Signal Genesys domains in alignment with where Signal Genesys content most commonly appears: regional news sites, broadcast affiliates, and syndicated editorial outlets.

In conclusion, Signal Genesys distribution partners appear prominently in LLM outputs.

More than 85 percent of SG domains surfaced in our 5.17M-citation sample, with regional news affiliates driving the majority of appearances.

This shows that content syndicated through Signal Genesys frequently reaches domains cited in LLMs, increasing the likelihood that press releases and media placements distributed through Signal Genesys will be incorporated into LLM-generated answers.

Link Laboratory Citation Analysis in LLM Outputs

Link Laboratory is a Search Atlas product that connects clients with more than 50,000 publisher domains across a wide range of authoritative websites.

The platform functions as a publisher exchange marketplace, enabling users to place sponsored or editorial content directly on high-authority websites to grow brand visibility and domain authority.

To understand how often these domains appear organically in LLM outputs, we matched 50,902 LinkLab publisher domains against the 713,615 unique domains referenced across the 5.17M URL citations in our data sample (with search disabled for OpenAI and Gemini, and enabled for Perplexity).

LLM Platforms Citing LinkLab Domains

Out of 50,902 LinkLab publisher domains, 3,512 were cited at least once in our sample dataset, appearing in a total of 85,357 URL citations.

The chart below shows how often each LLM platform cited LinkLab domains (with search disabled for OpenAI and Gemini, and enabled for Perplexity).

LLM platforms citing linklab domains

Key Insights

  • Perplexity’s real-time retrieval dramatically increases LinkLab domain visibility.
  • OpenAI, even with search disabled still cites LinkLab domains frequently, highlighting the authority of these publishers and how LinkLab placements can strengthen brand visibility within LLM outputs.

Conclusion

The analysis shows a consistent pattern across platforms, query intents, and industries: most LLM citations originate from commercial domains, reflecting the predominantly commercial nature of the web.

Academic and government sources remain marginal, while news and review domains appear as secondary citation types. This suggests that the dominance of commercial content in LLM responses stems less from model bias or user intent and more from the structure and availability of online information.

Both Search Atlas products (Signal Genesys and Link Laboratory) showed meaningful presence in LLM outputs. In our sample data, 86 percent of Signal Genesys domains appeared in URL citations, and Link Laboratory domains surfaced across 3,512 citations.

This indicates that content published through Signal Genesys and link-building across Link Laboratory’s domains aligns with the types of sites LLMs frequently reference, helping brands strengthen their online visibility.

Finally, this study analyzes a one-month sample of LLM responses. Future research should extend the timeframe, include more platforms, and analyze a broader range of query intent to track changes in citation diversity.

Join Our Community of SEO Experts Today!

Related Reads to Boost Your SEO Knowledge

Visualize Your SEO Success: Expert Videos & Strategies

Real Success Stories: In-Depth Case Studies

Ready to Replace Your SEO Stack With a Smarter System?

If Any of These Sound Familiar, It’s Time for an Enterprise SEO Solution:

You manage 25 - 1,000+ websites
You manage 25 - 1,000+ GBP accounts
You manage $50,000 - $250,000+ Google ad spend across your portfolio