You taught AI. Now it works for your competitor. Whoops.
Reverse citations aren’t a measurement bug. They’re a value capture problem.
Google AI Overviews got more accurate in the last year. Citations got less honest. Both numbers come from the same study.
Oumi ran a benchmark for the New York Times across 4,326 queries, twice. Once when AI Overviews ran on Gemini 2 (October 2025). Again when it ran on Gemini 3 (February 2026). Accuracy went from 85% to 91%. Ungrounded citations went from 37% to 56%. “Ungrounded” means the cited sources don’t actually support what the answer says. [1]
People keep framing this as a contradiction. It isn’t. Accuracy lives in one layer. Grounding lives in another. The model can improve one and degrade the other inside a single upgrade. That’s not a glitch. That’s the architecture doing exactly what it’s built to do.
You taught AI. Now it works for your competitor.
That’s not a metaphor. That’s the system.
Ann Smarty named it
Ann Smarty (@seosmarty) wrote about this in April. She gave it a name: reverse citations.
The mechanic isn’t hers. It’s how transformer-based RAG works. But the naming is what made the phenomenon visible to people working in the space, including me. Before reverse citations, this was “something weird about how AI cites stuff.” After, it was a thing you could point at.
The shape of the thing she named: model writes the answer first, from parametric memory. Then a separate process picks URLs that look topically related and staples them on as citations. The citation didn’t source the answer. It got picked after the fact because it looked plausible.
Ahrefs found 14% of AI responses contain factual errors and 8% of cited links are hallucinated. [12] The Oumi study found 56% of Gemini AIO answers ungrounded as of February. [1] This isn’t an edge case. This is the operating mode of an entire layer of how AI does information retrieval.
Ann’s piece is worth reading. [2] She framed it cleanly. What follows here is the architecture underneath the name, and what that architecture costs you.
The architecture
Two papers from earlier this year explain why reverse citations exist.
The first is Yeh and Li at UW-Madison. They studied how retrieved documents change what’s happening inside three production-scale LLMs. Their finding: later transformer layers favor parametric knowledge over retrieved evidence. Retrieved documents largely confirm what the model already believes. They don’t teach it new things. For hard questions where the model genuinely doesn’t know the answer, retrieved relevant documents fail to fix the wrong answer 35.6% of the time. [3]
The model already knows. The retrieval is decoration on top of the knowing.
The second is Khan and colleagues at Max Planck and Microsoft. They tested 12 LLMs by swapping source labels on identical content. The model’s selections shifted substantially based on who said it, not what was said. Source preferences override content. The preferences are also resistant. Explicitly prompting the model to “not be biased” had no meaningful effect. [4]
Together those two papers tell you: the model already knows the answer before it retrieves anything, and the URL it staples on at the end is selected by a separate preference that was baked in during training. That preference doesn’t care which content actually shaped the answer. It cares which source the model has been trained to prefer.
This is not a flaw being patched. This is how it works. Most of you GEO/SEO bros and sis’s should reread that paragraph above.
The two layers don’t move together
The training layer is what the model knows. The retrieval layer is what the model finds when it goes looking.
These can improve and degrade independently. Inside a single model upgrade. With no warning.
Gemini 2 to Gemini 3 is the live example. Accuracy improved 6 points. Grounding quality dropped 19 points. Same brand, same content, same web. Different model, different behavior, no input changed on your end. [1]
Resoneo’s framework calls this parametric vs dynamic visibility. [5] DEJAN’s Brand Authority Index measures it through 200,000 queries against Gemini 3. [6] Both confirm the same architecture from different angles: what the model knows is one thing, what the model retrieves is another, and the connection between them is whatever the platform decides it is on a given upgrade.
The implication people don’t want to face: retrieval-layer behavior is not a stable measurement target. It can change underneath you between upgrades, with no change to your underlying presence in the model.
The swap
Reverse citations work in both directions.
If the citation isn’t sourcing the answer, then the citation can be reassigned to anyone the platform’s heuristic prefers. The schema density of the cited page, the freshness of the cited page, the source preference baked into training. All of it runs after the answer is generated. The staple is a separate decision from the knowing. It’s a mf’ing free for all.
So three things happen and they all look identical from a citation tracking dashboard.
The first: you lose preference. The model used to know you, now it doesn’t. Real loss.
The second: the platform changed citation policy. The model still knows you. The visible credit just stopped surfacing. No actual loss in the parametric layer, full apparent loss in the retrieval layer.
The third: someone else became the preferred source. The platform’s heuristic now prefers their schema, their freshness, their domain authority pattern. They get credited for an answer the model is generating from your knowledge. Here’s how many shits the model gives:
That third one is the meanest. Growth Memo (Kevin Indig, @Kevin_Indig) published research in April measuring this. They found 61.7% of LLM citations are ghost citations. The domain gets a source link but the brand isn’t mentioned in the answer text at all. Only 13.2% of brand appearances convert into both a citation and a mention together. Gemini mentions brands 83.7% of the time but only generates a citation link 21.4% of the time. ChatGPT cites 87% of the time but mentions brands in only 20.7% of answers. [7]
The brand getting cited and the brand whose knowledge is being used are not the same brand most of the time. The decoupling is measured. It is not theoretical.
Profound (@tryprofound) found ChatGPT often cites direct competitors together. It doesn’t pick one winner. [8] AirOps (@AirOpsHQ) found ChatGPT only cites about 15% of the pages it retrieves. 85% retrieved, never cited. [9] There’s an experiment from April where someone published a blank webpage with rich structured data and no content. Within 36 hours, Perplexity cited it as a top source. The rich content of every brand that had been informing that category got displaced from the citation slot by a page that had nothing in it.
You taught AI. Now it works for your competitor.
You can spend years building the model’s understanding of your category. You can write the content, run the press, earn the mentions, do the work. The model learns. The model answers. And the model can answer with your knowledge while pointing at someone whose contribution to that knowledge is zero.
The previous era of SEO had a version of this problem. Competitors could rank for queries you educated the market on. But the SEO version had a tell: the user clicked a link and read content. They could see who was actually saying what. The AI version has no tell. The user reads the answer. The answer is shaped by your knowledge. The citation has someone else’s name. The user doesn’t know.
This is value capture, not measurement
Yes, the dashboards are unreliable. That’s true. Citation tracking can’t decompose “you got worse” from “the platform changed citation policy” from “your competitor improved their schema.” Click-vs-impression decomposition was already a known problem in this space. @SeerInteractive published it in April. [10]
The measurement issue is real. But the measurement issue is downstream of the bigger problem.
The bigger problem is what the metrics are measuring. Citation share measures who got the staple. It does not measure who built the answer. Treating citation share as a proxy for influence is treating the visible decoration as a proxy for the load-bearing thing.
The brands funding the category education are not necessarily the brands the model credits. The infrastructure was paid for collectively. The visibility is awarded individually. And the awarding criteria are things like schema density, freshness, source preference. They have very little to do with which brand actually contributed to the model knowing what it knows.
This is structurally similar to the Wikipedia problem. Many sources contribute, the canonical reference is what people cite. Except the canonical reference in AI isn’t a stable thing. It’s a stapling heuristic that re-rolls every model upgrade.
The retrieval layer is smaller than the dashboards make it look
Semrush ran a clickstream analysis across 17 months. Over 1 billion lines of US data, October 2024 through February 2026. They tracked how often ChatGPT actually triggers a web search. [11]
In late 2024, the share was 46%. By February 2026, it was 34.5%. The retrieval layer didn’t grow as the platform matured. It shrank.
Roughly two-thirds of ChatGPT queries are now being answered from parametric memory alone. No retrieval, no citations, no URL stapled on at the end. Just whatever the model knows about the topic. The Oumi finding on Gemini AIO points the same direction: even on queries where retrieval does run, 56% of the answers aren’t grounded in what was retrieved. [1]
Stack the two findings together. ChatGPT: ~65% of queries answered without retrieval at all. Gemini AIO: 56% of retrieved answers not actually grounded in retrieval. The retrieval layer is smaller and less reliable than visibility tools assume. The training layer is doing more work than it is getting credit for.
What to do about it
The training-layer presence you build is durable. The retrieval-layer presence you have today can be reassigned next upgrade.
The work that builds training-layer presence is the work that’s always been called PR, earned media, thought leadership, category education. Press placements, brand mentions, authoritative co-occurrence, Wikipedia, reference-grade content, content that gets cited by other people in the space. Recent research suggests distributing content across a wide range of publications can lift AI citations by hundreds of percent compared to publishing only on your own site. The mechanism is the training layer. The model’s understanding of your category gets built from where you appear, who mentions you, what context you appear in.
Retrieval-layer optimization still matters. On-page structure, schema density, freshness, the technical work of being citable. It’s just newly understood as the layer that can be reassigned to someone else without warning. You don’t own the citation slot. You influence it on a budget set by someone else. It’s a fickle bitch.
The portfolio question is: where are you investing? If only retrieval-layer, you’re optimizing for a shrinking, drifting, reassignable surface. If only training-layer, you’re invisible until the platform’s heuristic happens to credit you. The honest answer is both, in proportion to how durable you want your visibility to be.
The diagnostic that distinguishes “you got worse” from “the platform changed citation policy” requires measuring both layers. Search-off audits paired with retrieval-side citation tracking, decomposed properly. Almost nobody does this yet.
The training-era work is still the durable work. It just isn’t the work that earns the visible credit anymore.
You can build the model’s understanding of your category over years. You can spend money you can’t get back. The model can answer with your knowledge, and the user reading the answer will see someone else’s name on the citation.
You taught AI. Now it works for your competitor.
Decide which layer you’re building toward. Both are real. Only one of them is yours.
This article was originally published on X by Aaron Haynes. Aaron is the CEO of Loganix, a visibility + SEO platform for brands and agencies.
Sources
[1] Oumi / The New York Times — Google AI Overviews accuracy and grounding analysis. April 7, 2026. https://www.nytimes.com/2026/04/07/technology/google-ai-overviews-accuracy.html
[2] Ann Smarty — “Reverse Citations: What SEOs/GEOs Need to Know.” SEO & AI Newsletter, April 14, 2026. https://annsmarty.com/p/reverse-citations-what-seosgeos-need
[3] Yeh & Li — “How Retrieved Context Shapes Internal Representations in RAG.” UW-Madison, February 2026. https://arxiv.org/abs/2602.20091
[4] Khan et al. — “In Agents We Trust, but Who Do Agents Trust? Latent Source Preferences Steer LLM Generations.” MPI-SWS / Microsoft, February 2026. https://arxiv.org/abs/2602.15456
[5] Resoneo — “Two types of LLM visibility.” March 2026. https://think.resoneo.com/chatgpt/5.3-5.4/
[6] DEJAN AI — Brand Authority Index, Dan Petrovic. March 28, 2026. https://dejan.ai/blog/brands/
[7] Growth Memo — Ghost citations and platform-specific brand mention vs. citation patterns. April 2026. https://www.growth-memo.com/
[8] Profound — ChatGPT competitor co-citation patterns. February 2026.
[9] AirOps — ChatGPT retrieval-to-citation rate analysis. March 2026. https://www.airops.com/
[10] Seer Interactive — Brand-cited CTR decomposition on AIO SERPs. April 2026. https://www.seerinteractive.com/
[11] Semrush — “ChatGPT traffic analysis: Insights from 17 months of clickstream data.” February 2026. https://www.semrush.com/blog/chatgpt-search-insights/
[12] Ahrefs — AI response factual error and hallucinated link rates. 2025. https://ahrefs.com/blog/
Written by Aaron Haynes on May 6, 2026
CEO and partner at Loganix, I believe in taking what you do best and sharing it with the world in the most transparent and powerful way possible. If I am not running the business, I am neck deep in client SEO.



