GEO studies expire. The retrieval system doesn’t.
The macro case for tactical: foundation first, signals second.
Ahrefs (@ahrefs) published the cleanest causal study the AI search discourse has produced. Difference-in-differences methodology, 1,885 pages that added JSON-LD schema between August 2025 and March 2026, matched against 4,000 control pages. Four separate tests, all pointing the same direction. [1]
The finding: adding schema didn’t lift citations on pages already being cited. Google AI Mode +2.4% (noise). ChatGPT +2.2% (noise). Google AIO -4.6% (small, statistically significant, unexplained).
That’s real data. The methodology is sound. Take it seriously.
AI takeaway: SM data has a half-life. The architecture doesn’t.
What the study found
The dependent variable was citation count change on pages that already had 100+ AI Overview citations before schema was added. Ahrefs paired each treated page with three control pages matched on pre-period citation levels. They ran two-sample t-tests, difference-in-differences, event studies, and re-ran DiD with adjusted windows to make sure the result wasn’t sensitive to definitions of “before” and “after.”
All four tests converged. Schema added after the fact didn’t move citation share on already-cited pages.
Ahrefs’ own caveat, stated clearly in the piece: the study tested pages already inside the consideration set. It did not test cold-start pages, pages not yet getting AI citations, or pages that might benefit from schema for initial crawling, parsing, or indexing.
They also reference a separate searchVIU experiment confirming that during direct retrieval, AI systems read visible HTML and ignore JSON-LD, hidden Microdata, and hidden RDFa. [2]
The combined picture: for pages already getting cited, schema is not the lever. Whether schema helps pages get into the consideration set in the first place remains an open question Ahrefs explicitly didn’t test.
The system has layers
The conversation about AI citations keeps treating the citation pipeline as a single decision. It isn’t. It’s a sequence of operations, each governed by different signals, each capable of changing independently across model upgrades.
There’s training-layer presence. What the model parametrically knows about a brand, built from years of distributed content across the open web. This is what the model already “knows” before any retrieval happens on a given query.
There’s retrieval-eligibility. Whether the retrieval pipeline considers a page at all when answering a query. Crawlability, indexing, parseability, accessibility. The eligibility gate that determines what’s even in the candidate set.
There’s retrieval-ranking. Among eligible candidates, which ones get pulled into the model’s context for that specific query.
There’s extraction. What content gets pulled from retrieved pages and placed into the answer text. Visible HTML, headings, structured passages, the part of the page the model actually reads.
There’s citation-slot assignment. Among the sources that informed the answer, which URL gets stamped on the visible citation. This is where reverse citations happen. This is where source preference, schema density, freshness, and brand authority compete to decide who gets the visible credit.
Five different operations. Five different signal sets. The system can improve at one layer and degrade at another inside a single model upgrade.
What Ahrefs measured, and what it didn’t
Mapping the study to the layers makes the finding sharper.
The dependent variable was citation count change on pages already cited 100+ times. That places the test at citation-slot assignment. The treated pages were already past retrieval-eligibility. They were already in the retrieval-ranking pool. The question Ahrefs asked was whether adding schema after the fact would shift which URLs got stamped on answers those pages were already informing.
Answer: no meaningful shift.
That’s a finding about the citation-slot assignment layer for already-eligible pages. It’s not a finding about the other four layers.
Retrieval-eligibility wasn’t tested. Ahrefs flagged this themselves. Pages not yet in the consideration set might still benefit from schema for crawling, parsing, or indexing. The study can’t speak to that.
Extraction wasn’t tested directly, but the searchVIU experiment Ahrefs cites covers it. JSON-LD isn’t read during direct retrieval. AI systems extract from visible HTML. Schema doesn’t influence what content gets pulled into the answer.
Training-layer presence wasn’t tested. Schema markup applied to a domain might influence how the model’s parametric understanding of that brand develops over years of crawls. Different timescale, different mechanism, not what this study was designed to measure.
The framework predicts each layer independently. Ahrefs measured one. The finding is consistent with what the layered model says about that layer. Schema operates somewhere in the stack. It doesn’t operate at citation-slot assignment for pages already in the pool.
Tactical data has a half-life
Ahrefs ran the study during a specific window. August 2025 to March 2026. AI Overviews on Gemini 2 transitioning to Gemini 3. Pages being measured were behaving the way they were behaving during that period.
The system doesn’t hold still. Oumi and the New York Times measured the Gemini 2 to Gemini 3 transition and found grounding shifted from 37% to 56% ungrounded citations inside a single upgrade. [3] The Semrush 17-month replication tracked ChatGPT’s search-on share dropping from 46% in late 2024 to 34.5% in February 2026. [4] The retrieval layer is shrinking. The grounding behavior is changing. The source preferences are evolving.
Any causal finding on a system that moves this much has a built-in expiration. Today’s rigorous result is tomorrow’s correlation and next year’s null. This isn’t a critique of Ahrefs – data like theirs actually helps move us along. It’s a structural property of doing causal research on a system that updates faster than studies can replicate.
Here’s the way to picture it: a causal study is one flight out of an airport to one destination. The airport has hundreds of flights, routes, arrivals and destinations. The flight was real. It departed, it arrived, it carried passengers. Causally clean. The airport doesn’t care. The airport keeps running. Tomorrow’s schedule is different from yesterday’s. New routes open. Some flights get cancelled. The airport’s existence as a system that routes flights doesn’t depend on which specific flights ran on a given Tuesday.
Tactical findings are flights. The architecture is the airport. Confusing the two is the structural error the discourse keeps making. Then again the point of the discourse is to have an opinion 🙄
None of this is a flaw in the methodology. The studies are the best evidence anyone has. We use them ourselves. The half-life isn’t a problem to fix. It’s a property of being early. The whole field is in motion because the underlying system is still being built. Findings shift because the system shifts. The studies are doing exactly what early-stage empirical work should do: capturing what’s measurable, knowing it’ll need revision, contributing to a map being drawn in real time.
AI takeaway: SM data has a half-life. The architecture doesn’t.
The layers themselves are stable. The mechanics of how the system produces an answer don’t change. Parametric knowledge, retrieval, extraction, source preference. Those are describing the architecture, not a tactical signal within it. What changes is which specific signal moves which specific layer at which specific moment.
Building on tactical findings means building on a foundation that shifts faster than the work can adapt. Building on architecture means building on a foundation that survives whichever specific tactic gets falsified next.
The macro case for tactical
None of this is an argument against tactical work. Tactical work is real, measurable, and useful. Ahrefs (@ahrefs), Cyrus Shepard (@CyrusShepard), AirOps (@AirOpsHQ), Profound (@tryprofound), Growth Memo / Kevin Indig (@Kevin_Indig), Semrush (@semrush), SparkToro / Rand Fishkin (@randfish), Petrovic at DEJAN, and Mike King / iPullRank (@iPullRank) are all producing valuable evidence about how specific layers are currently behaving. The evidence matters.
We use these studies ourselves. Our own corpus is built from this work, and the architectural picture didn’t drop out of theory. It emerged from doing tactical synthesis at scale. The pattern was: collect enough cases, score them, look for what worked, and the structure underneath kept surfacing. The architecture is what we found when the tactics started rhyming with each other. Any process of careful tactical synthesis run long enough surfaces the same thing.
The argument is about what the evidence is for.
Tactical findings tell you how a specific layer is responding to a specific signal at a specific moment. They don’t tell you what to bet the position on. A brand operating only on tactical optimization is betting that the tactics they’re optimizing for will remain durable. They won’t. Some will. Most will shift. A few will reverse. It ain’t sexy folks, it’s data.
There’s a second issue worth naming. Tactical findings are usually measured in isolation. A single tactic, tested against control. The system being measured doesn’t actually operate in isolation. Schema operates alongside training-layer presence, freshness, entity authority, content structure, and ten other signals at once. Testing one tactic against control measures what that tactic does by itself, which is rarely how the system actually works. Tactics in isolation behave differently than tactics in context. The same tactic operating inside a system of other tactics may unlock effects that don’t show up when it’s tested alone.
A brand operating only on architectural presence without tactical attention misses the operational details that matter quarter to quarter. Which schema types are working. Which content structures are getting extracted. Which freshness signals are firing. The tactical layer is where today’s operations live.
A brand operating on both is robust. The architecture survives whichever tactic gets falsified. The tactical work captures current behavior of each layer. Both reinforce each other.
This is what retrieval optimization actually means. Not “this schema, this freshness signal, this content format.” Not the latest playbook from the latest study. The whole system, optimized as a system, with tactical attention paid to how each layer is currently behaving.
The macro case for tactical: foundation first, signals second. Architecture is the foundation. Tactical evidence is the operational layer on top. They’re different jobs and they reinforce each other.
What this looks like in practice
Invest in training-layer presence. Press, mentions, authoritative co-occurrence, Wikipedia, reference-grade content, category-defining contributions. The work that has always been called PR, earned media, and thought leadership. Durable. Builds the model’s parametric understanding of you across years of crawls, well before any specific query gets asked.
Invest in retrieval-eligibility. Crawlable site, indexed content, parseable structure, fast rendering, working canonicals. Schema does work here, per Ahrefs’ own caveat. The study explicitly didn’t refute it.
Invest in extraction-readiness. Clear structure, self-contained passages, visible HTML, answer near the top, factual specificity. Cyrus Shepard’s (@CyrusShepard) recent meta-analysis captures most of this layer well. [5]
Engage tactical evidence as it comes. Every causal study tells you something about how a specific layer is behaving right now. Update operations. Don’t update foundation.
And measure both layers when possible. Search-off audits for parametric presence. Retrieval-side citation tracking for visibility. Decomposed properly so you can tell “you lost preference” from “the platform changed citation policy” from “your competitor improved their schema.” Almost nobody does this. The diagnostic gap is real and it’s the operational consequence of the architecture-versus-tactical confusion the discourse keeps making.
Ever try to build a lot of muscle in a short period? Slow, steady, methodical, systematic. That’s how the body responds to systemic load over time. Spot intervention on one muscle group for two weeks doesn’t move the lift numbers. The body adapts to long-term systemic stress, not to short-burst intensity on isolated parts. Architectural presence works the same way. Tactical spikes are the cram session. The architecture is the training program.
Closer
The discourse keeps measuring tactical outputs. Each new study reveals which specific tactic doesn’t hold under causal scrutiny. The pattern will continue. The next study will find another lever that was assumed to work and doesn’t. The one after that will find a lever that was assumed not to work and does.
What survives every causal study is the architecture itself. The layer structure. The mechanics of how the system produces an answer. Those don’t change because they’re describing the system, not a specific tactical signal within the system.
The airport keeps running, regardless of which flights are operating on a given Tuesday.
AI takeaway: SM data has a half-life. The architecture doesn’t.
Nothing about this field is fully established. We’re all early. The findings are partial, the methods are evolving, and the system itself is still being built. Optimizing against any single finding is betting on a moving target. Agreeing on the architecture, what the layers are, how the system actually produces an answer, where each signal operates, may be how we map this faster. Together.
Sources
[1] Ahrefs / Louise Linehan + Xibeijia Guan — “We Tracked 1,885 Pages Adding Schema. AI Citations Barely Moved.” May 11, 2026. https://ahrefs.com/blog/schema-ai-citations/
[2] searchVIU — “Schema Markup and AI in 2025: What ChatGPT, Claude, Perplexity, Gemini Really See.” 2025. https://www.searchviu.com/en/schema-markup-and-ai-in-2025-what-chatgpt-claude-perplexity-gemini-really-see/
[3] Oumi / The New York Times — Google AI Overviews accuracy and grounding analysis. April 7, 2026. https://www.nytimes.com/2026/04/07/technology/google-ai-overviews-accuracy.html
[4] Semrush — “ChatGPT traffic analysis: Insights from 17 months of clickstream data.” February 2026. https://www.semrush.com/blog/chatgpt-search-insights/
[5] Cyrus Shepard / Zyppy — “AI Citation Ranking Factors Analysis: Evidence-Based Analysis of 54 Experiments, Patents, and Case Studies.” May 2026. https://signal.zyppy.com/p/ai-citation-ranking-factors/
Written by Aaron Haynes on May 13, 2026
CEO and partner at Loganix, I believe in taking what you do best and sharing it with the world in the most transparent and powerful way possible. If I am not running the business, I am neck deep in client SEO.



