A Comparative Evaluation of LLM Responses from Gemini, OpenAI, and Perplexity

You may also read a concise version of this research in our blog: Comparative Analysis...

Did like a post? Share it with:

You may also read a concise version of this research in our blog: Comparative Analysis of LLM Citation Behavior: SEO Strategy Implications

Introduction

This study compares how large language models (LLMs) reference external sources when responding to identical queries. By examining their domain citation behavior, we assess whether differences in web search capability and model architecture influence how information is retrieved and attributed.

The dataset comprises 5,504,399 LLM responses across 748,425 unique user queries, collected over a 30-day period between August 25 and September 25, 2025. Among the models studied, Perplexity Sonar operates with web search enabled, while Gemini-2.0-Flash-Lite and OpenAI’s GPT-4o-mini generate responses without live retrieval. This configuration provides a controlled framework to evaluate citation breadth, overlap, and agreement across systems with distinct access to external data sources.

Summary of Dataset

  • Total responses: 5,504,399 (Gemini, OpenAI, and Perplexity combined)
  • Unique prompt queries: 748,425
  • Data collection period: August 25 – September 25, 2025
  • Models analyzed:
    • Perplexity Sonar – Web search enabled
    • Gemini-2.0-Flash-Lite – Web search disabled
    • OpenAI GPT-4o-mini – Web search disabled

Methodology

Data Source and Collection

The analysis draws from a dataset of 5.5 million LLM-generated responses spanning 748,425 queries, collected between August 25 and September 25, 2025. The dataset includes outputs from Gemini, OpenAI, and Perplexity Sonar, representing both models with and without active web retrieval.

Data Normalization and Filtering

All citations were extracted from model outputs, and cited domains were standardized to a normalized domain.tld format to ensure cross-model consistency. For fair comparison, only queries where all three LLMs produced citations were retained for analysis.

Analytical Framework

Citation behavior was evaluated using three primary metrics:

  1. Domain citation count – Number of unique domains cited per query.
  2. Jaccard similarity – Ratio of shared to total unique domains between model pairs.
  3. Agreement rate – Percentage of queries where at least one domain overlapped across models.

Extended Analyses

Complementary evaluations examined response length, citation density, and URL freshness to assess whether verbosity or publication recency influenced retrieval diversity and citation breadth.

Domain Citation Behaviour Across LLM Models

Total Domains Cited by Each LLM

For samples where all three LLMs cited domains for the same query, below is a plot of the total domains cited by each LLM model.

Total Domains Cited by Each LLM

Distribution of Domain Citations per Query

Distribution of Domain Citations per Query

Average Domain Cited per Query

For samples where all three LLMs cited domains for the same queries, this chart shows the average domains cited per query.

avg domains cited per query (mean)

The median domains cited per query shows the typical number of sources each model references for a given question, providing a clearer picture of their usual citation behavior without the influence of outliers.

avg domains cited per query (median)

AI Cited Domain Agreement Across LLM Models

For samples where all three LLMs cited domains for the same queries, this chart shows the average similarity between each pair’s cited domains, measured using Jaccard similarity (calculated as the size of the intersection divided by the size of the union of their cited domains).

Note on Web Search: Web search was disabled for Google and OpenAI. It remained enabled for Perplexity, as their web search is always on and cannot be deactivated.

This may explain why Perplexity returns more domains per query.

Formula: Jaccard Similarity

To measure the similarity between two sets of LLM responses for the same query, we used Jaccard similarity.


The formula is:

J(A,B) =| A B || A B |

Example:

Using Jaccard similarity, sets:

  • Gemini = {A, B, C, D, E},
  • OpenAI = {A, B, C, D, E, F, G}

Domains cited by both Gemini and OpenAI (intersection) = {A, B, C, D, E } = 5

All unique domains across both = {A, B, C, D, E, F, G} = 7

Jaccard similarity = intersection (5) / Union (7) = 0.714 = 71.4%

Average Domain Overlap (Agreement Rate) Between LLMs

Average Domain Overlap (Agreement Rate) Between LLMs

Percentage of queries where each model pair agreed on at least one citation

queries with at least one shared domain between LLM pairs

Across LLM pairs, Gemini and OpenAI show the strongest citation alignment, sharing ≈ 42 % of cited domains on average.

However, while overall overlap is modest, most queries (≈ 60–65 %) still contain at least one shared domain, indicating partial convergence in source selection even when full citation sets differ.

Distribution of Domain Overlap Scores

Distribution of Domain Overlap Scores

Gemini and OpenAI show the highest and most stable domain overlap, while overlaps involving Perplexity are lower and more dispersed, likely because Perplexity’s active web search retrieves a wider and more diverse set of sources.

Venn Diagram: Domain Citations Overlap Between LLMs

For samples where all three LLMs (Perplexity, OpenAI, and Gemini) cited domains for the same set of queries, the chart below shows the domain citation overlap between all three models.

Venn Diagram: Domain Citations Overlap Between LLMs

LLM Output Citation Count & Length Comparison

For samples where all three LLMs cited domains for the same queries, we analyzed the number of citations generated by each model and the corresponding output length. This comparison helps assess whether longer responses tend to include more citations and whether citation density varies significantly across LLMs.

Citation Count by Platform & Model

Key Insights:

OpenAI GPT‑4o‑mini‑2024‑07‑18 cites far less frequently than the other four models, with no extreme outliers in citation count.

For consistent source attribution, Perplexity (Sonar) stands out as the clear choice, providing citations in nearly every response.

OpenAI GPT‑5‑nano‑2025‑08‑07 and Gemini‑2.0‑Flash‑lite behave similarly: they rarely cite, and when they do, it’s often in the form of rare, citation-heavy outliers.

Example of Queries with Low Domain Citations

Example of Queries with Low Domain Citations

Insights from Single-Citation Examples

  1. Local or Brand-Specific Queries
    Several prompts directly reference a single company or service provider, such as “Action Pest Control bed bug removal Mid-South,” “Rick Lucas Plumbing maintenance service,” and “Toyota of Boerne best-selling new vehicles.” In these cases, the cited domain represents the official business website, which fully satisfies the information intent.
  1. Product and Service Reviews
    Queries such as “Appliances Connection reviews on brand selection” or “entrepreneur tool reviews and comparisons” often point to domains that either host reviews (reviewed.com) or represent platforms that publish comparison-oriented marketing content (monday.com). In both cases, one authoritative domain provides sufficient coverage for the topic.
  1. General Informational or How-To Queries
    Some queries (e.g., “How to improve with Chess.com puzzles,” “how AI changes SEO for marketers”) rely on instructional or standardized knowledge that can be supported by one high-trust source (e.g., chess.com, schema.org).
  1. Web Search Availability and Single-Citation Behavior Across Models

Although Perplexity operates with web search enabled while Gemini and OpenAI do not, all three models display single-citation behavior for certain query types. This suggests that enabling web search does not necessarily expand citation breadth in every case, particularly for narrow, brand-specific, or self-contained prompts where one domain sufficiently answers the query.

LLM Response Length (Character Count) by Platform & Model

LLM Response Length (Character Count) by Platform & Model

Key Insights:

– For concise answers, Perplexity (Sonar) performs best.

– For verbose/detailed outputs, Gemini is the clear leader (though sometimes excessively long).

– OpenAI models strike a middle ground between verbosity and brevity, with gpt-4o leaning more towards brevity, while gpt-6 nano tends to be slightly more verbose.

LLM Response Length vs Citation Relationship

  1. Response Length Variability Across Models:
    Gemini (Gemini 2.0 Flash-Lite) shows the widest range in output length, occasionally exceeding 60,000 characters, indicating a tendency toward longer and more verbose responses. In contrast, Perplexity (Sonar) consistently produces shorter outputs with minimal variation.
  2. Citation Density Differences:
    Despite shorter responses, Perplexity tends to include relatively more citations per output than Gemini or GPT-4o-mini, suggesting a stronger focus on referencing. OpenAI’s GPT-5-nano occasionally produces very high citation counts, though these are outliers.
  3. Length vs. Citation Count Relationship:
    While longer outputs (e.g., Gemini) generally provide more opportunity for citations, the data does not show a direct correlation between response length and citation count. Some models (like Perplexity) achieve higher citation density even within shorter responses, implying that citation behavior is model-specific rather than purely a function of length.
  4. Model Output Characteristics:
    • Gemini: Longest, most variable responses; fewer citations per length.
    • GPT-5-nano: Moderate length with occasional bursts of high citation counts.
    • GPT-4o-mini: Shorter, concise responses with balanced citation levels.
    • Perplexity Sonar: Shortest responses but most citation-dense overall.

Conclusion

This analysis reveals clear differences in how major LLMs cite external sources, shaped by their architecture and access to web search.

  • Perplexity Sonar consistently delivers the highest citation density and broadest domain diversity. Its always-on web search allows it to cite more recent URLs and multiple domains per query, making it the most transparent and retrieval-aligned model.
  • Gemini 2.0 Flash-Lite produces the longest responses, often exceeding 60k characters, but cites fewer sources. While it shares some domain overlap with OpenAI, it favors verbosity over breadth of citations.
  • OpenAI GPT models (GPT-4o-mini, GPT-5-nano) offer a balanced approach: moderate response lengths, consistent citation behavior, and strong overlap with Gemini on trusted domains. GPT-5-nano occasionally shows bursts of high citation count.

Key Observations:

  • Citation overlap across models is limited but meaningful, with about 60–65% of shared queries including at least one common domain.
  • Models without web search (e.g., Gemini, OpenAI) still cite authoritative sources effectively, especially for niche or brand-specific queries.
  • There is no direct correlation between response length and number of citations, indicating that verbosity does not necessarily translate to citation richness.

Implications

Models with web retrieval capabilities (like Perplexity) are more suitable for tasks requiring up-to-date, citation-rich, and diverse content. In contrast, non-retrieval models (such as Gemini and OpenAI) may perform adequately in structured or narrowly scoped tasks. 

Ultimately, model selection should align with the specific trade-offs between citation fidelity, verbosity, and content freshness.

Join Our Community of SEO Experts Today!

Related Reads to Boost Your SEO Knowledge

Visualize Your SEO Success: Expert Videos & Strategies

Real Success Stories: In-Depth Case Studies

Ready to Replace Your SEO Stack With a Smarter System?

If Any of These Sound Familiar, It’s Time for an Enterprise SEO Solution:

You manage 25 - 1,000+ websites
You manage 25 - 1,000+ GBP accounts
You manage $50,000 - $250,000+ Google ad spend across your portfolio