Skip to main content
ShieldMarc
Resources/Guides
Guide

DNS and AI Discovery: How AI Crawlers and Agents Find and Trust Your Domain

“DNS for AI discovery” sounds like a single new standard. It is not. As of 2026 it is two separate layers that answer two different questions: an HTTP file layer that AI crawlers and agents fetch over the web today, and a DNS layer that establishes who your domain really is. This guide explains both honestly, separates what is shipping from what is still a proposal, and shows where a DNS owner should actually spend their effort.

12 min read

Two layers, one domain

When people ask how to make a domain “discoverable to AI”, they are usually mixing two things that live in different places:

  • The HTTP file layer. Files an AI system fetches over the web: robots.txt, an llms.txt map, a Content Signals line, and agent manifests under /.well-known/. This is where almost all of the deployed controls and current proposals live.
  • The DNS layer. Records that prove identity and carry trust: the email-authentication records (SPF, DKIM, DMARC), MTA-STS, and DNSSEC as the cryptographic anchor underneath them. This is where verifiable trust already works, and where a record built specifically for AI is just emerging as an early draft (DNS-AID), not yet a ratified standard.

The honest headline: most of what is called “AI discovery” today is plain HTTP fetching governed by robots.txt, not a new AI-specific protocol. Some of the rest is preference signalling that is honoured only voluntarily, and some is still an unratified draft. Knowing which is which stops you from chasing things that do nothing.

Who actually crawls you: the AI user-agents that matter in 2026

Before you can control AI access, you need to know who is asking. Each major AI vendor runs several distinct bots, and they do different jobs. Treating them as one blob is the most common mistake. The reliable rule of thumb: bots that crawl to build a training set or a search index generally honour robots.txt, while fetchers triggered by a human in real time often do not, because a person asked for that specific page.

TokenOperatorWhat it doesHonours robots.txt
GPTBotOpenAICrawls content that may train foundation modelsYes
OAI-SearchBotOpenAISurfaces sites in ChatGPT search resultsYes
ChatGPT-UserOpenAIUser-initiated fetch from a ChatGPT sessionUser-triggered, rules may not apply
ClaudeBotAnthropicCrawls content for model trainingYes
Claude-SearchBotAnthropicBuilds the Claude search indexYes
Claude-UserAnthropicUser-initiated fetch from a Claude sessionYes
Google-ExtendedGoogleControl token to opt out of Gemini training and grounding (not a crawler)Control token only
PerplexityBotPerplexityIndexes and links sites in answersYes
Perplexity-UserPerplexityUser-initiated fetch from a Perplexity sessionGenerally does not
CCBotCommon CrawlBuilds the public Common Crawl corpus, a long-standing training source for LLMsYes

Two details catch people out. First, Google-Extended is not a crawler. It has no user-agent string of its own and never fetches a page. It is a label you put in robots.txt to say whether content Google already crawled may be used to train Gemini or for grounding. Google states it does not affect your inclusion or ranking in Google Search. Applebot-Extended works the same way for Apple's AI training.

Second, match on the stable token, not the version string. Anthropic's old tokens anthropic-ai and claude-web are deprecated and no longer affect live Claude traffic, so a rule written against them does almost nothing today. The effective control for Anthropic is ClaudeBot. Block or allow by the stable name and ignore the version suffix in the full user-agent string.

robots.txt is still the lever

For all the talk of new AI standards, the single most broadly honoured control is the oldest one. The Robots Exclusion Protocol (standardised as RFC 9309) is what every major AI vendor points site owners to when they ask how to opt in or out. You allow or disallow each bot by its token:

User-agent: GPTBot
Disallow: /private/

User-agent: ClaudeBot
Disallow: /private/

User-agent: CCBot
Disallow: /

The thing to internalise is what robots.txt is not. It is advisory. Honouring it is voluntary, the user-initiated fetchers above may bypass it, and a scraper with no interest in the rules will simply ignore the file. It is a sign on the door, not a lock. If you need to actually prevent access you need edge controls such as a web application firewall, bot management or rate limiting, which sit in front of the site and enforce in a way a text file never can.

The practical hygiene point matters more than any single rule: keep robots.txt, your Content Signals line and any llms.txt consistent. It is surprisingly easy to disallow a path in one place while advertising the same path in another, which tells AI operators two different stories about the same content.

Content Signals: stating how your content may be used

robots.txt controls whether a bot may access a page. It says nothing about what happens to the content afterwards. Cloudflare's Content Signals Policy, launched on 24 September 2025 as a free, openly licensed extension to robots.txt, fills that gap. It adds a Content-Signal line with three independent yes or no signals:

  • search: building a search index and returning links and short excerpts. This explicitly excludes AI-generated summaries, so search=yes does not permit AI answers.
  • ai-input: real-time use such as retrieval-augmented generation or grounding a generated answer.
  • ai-train: training or fine-tuning AI models.
Content-Signal: search=yes, ai-input=yes, ai-train=no

Like robots.txt, this is a stated preference, not a technical block. Cloudflare is explicit that content signals express preferences and are not countermeasures against scraping. A non-compliant operator can ignore the line with no technical barrier, and Cloudflare frames the restriction as a reservation of rights rather than a tested legal ruling. Two subtleties are worth getting right: a blank signal means no preference was expressed through this mechanism, not that you have no preference; and because the three signals are independent, you can welcome search indexing while declining model training.

This space is also where the only live standards body work is happening. The IETF AI Preferences working group is standardising a vocabulary for exactly these usage preferences, carried over robots.txt, HTTP response headers and /.well-known/ URIs. It is still draft work, and notably it uses no DNS at all.

llms.txt: a useful map, not a ranking signal

llms.txt is a Markdown file served from your site root, proposed by Jeremy Howard of Answer.AI in September 2024. It borrows its location convention from robots.txt and sitemap.xml: a single H1 with your site name, an optional summary, then curated links to your most useful pages. A companion llms-full.txt is a different idea again, a single concatenated dump of your content meant to be loaded straight into a context window.

Here is the part the hype usually leaves out. It is an unratified one-author proposal, not a web standard, and no major search engine or AI answer engine consumes a third-party llms.txt as a crawl, ranking or citation signal. Google has said on the record that it does not support the file and is not planning to, and independent server-log analysis shows AI bots request it very rarely. When you see Anthropic or Perplexity publishing their own llms.txt, that is documentation delivery for coding agents (often generated automatically by their docs platform), not their crawlers consuming other sites' files.

So why publish one at all? Because it is cheap, and it is a clean, structured map of your guides and tools for any agent that does choose to fetch it. Treat it as a convenience with a small upside, do not expect it to move visibility, and keep it consistent with robots.txt so you never list a page you also disallow. Anyone selling llms.txt as a ranking lever is overstating the evidence.

Agents, not crawlers: discovery via .well-known

Crawlers read your pages. AI agents need something different: a machine-readable description of what a service can do and how to call it. That discovery uses the /.well-known/ path, reserved by RFC 8615. That RFC is generic plumbing, not an AI-specific standard, and it predates the current wave by years.

  • Agent Cards (A2A).The Agent2Agent protocol publishes a JSON Agent Card, describing an agent's identity, endpoint, capabilities and required authentication, fetched by a plain HTTP GET from a well-known URI. A2A originated at Google in April 2025 and was contributed to the Linux Foundation in June 2025, so it now sits under vendor-neutral governance. The path moved from /.well-known/agent.json to /.well-known/agent-card.json in version 0.3.0, with the older path still widely supported.
  • The Model Context Protocol (MCP). A common misconception is that MCP servers are discovered through a public well-known file. They are not. An MCP client is pointed at a known server URL or a local process. Where /.well-known/ appears in MCP it is OAuth authorisation metadata, used only after the server URL is already known, and authorisation is optional.
  • The retired plugin manifest. The old ChatGPT plugin system used a /.well-known/ai-plugin.json manifest. That system was sunset in 2024 and replaced by custom GPTs. Stale manifest files still litter the web, but they are not a live discovery method.

None of these is an IETF standard. A2A and MCP are open industry specifications. Only the underlying plumbing (RFC 8615 and the OAuth metadata RFCs) is standards-track.

The DNS layer: identity and trust, with email as the precedent

Everything above answers “can you read this, and how may you use it?” DNS answers a harder question: “is this domain really who it claims to be?” That is where cryptographically anchored identity already works today, and email authentication is the proven pattern any future AI-trust record would most likely follow. Each control is a DNS record in a predictable place:

ControlWhere it lives in DNS
SPF (RFC 7208)A TXT record at the domain apex, starting v=spf1
DKIM (RFC 6376)A TXT record at selector._domainkey
DMARCA TXT record at _dmarc
MTA-STS (RFC 8461)A TXT pointer at _mta-sts, policy served over HTTPS
BIMIA TXT record at default._bimi pointing to an SVG logo

MTA-STS is the cleanest analogy for how an AI-trust record would probably work. A short TXT record at _mta-sts.<domain> carries only a version and an id, just enough for a sender to notice that a policy exists and whether its cached copy is current. The enforceable policy itself is served over HTTPS from /.well-known/mta-sts.txt with a valid certificate. DNS announces, HTTPS serves. If a future standard ever lets an AI agent verify a domain's policy, that DNS-announces-HTTPS-serves shape, anchored by DNSSEC, is the obvious template. You can read the full mechanism in our MTA-STS and TLS-RPT guide.

The piece that makes any of this trustworthy is DNSSEC. Without it, DNS answers can be forged in transit, so a signal published in DNS is only as dependable as the chain of trust above it. DANE, which binds a TLS certificate to a domain through a TLSA record, provides no security at all unless DNSSEC is in place. That is why DNSSEC, not any AI-specific record, is the foundation a domain owner should care about first.

BIMI shows both the appeal and the limits of this layer. A TXT record at default._bimi points to your logo, but inbox display is gated behind a paid mark certificate referenced in the record. Gmail will not show a logo from a self-asserted record alone: it needs either a Verified Mark Certificate, which requires a registered trademark and also earns the verified checkmark, or, since October 2024, a Common Mark Certificate, which needs no trademark but shows the logo without the checkmark. Those certificates run on the order of roughly £600 to £1,300 a year. Some inboxes, including Apple Mail and Yahoo, display self-asserted logos with no certificate. Our BIMI guide covers the trade-offs.

The DNS layer is catching up: DNS for AI Discovery (DNS-AID)

Until recently there was no way to announce your AI or agent endpoints in DNS the way _dmarc announces your email policy. That gap is now being filled. The leading proposal is DNS for AI Discovery (DNS-AID), an active IETF Internet-Draft (draft-mozleywilliams-dnsop-dnsaid), and it is worth understanding even though it is early.

DNS-AID does not invent a new record type. It is a naming convention layered on the existing SVCB service-binding record (RFC 9460). You publish ServiceMode SVCB records under an _agents leaf, with _index._agents.example.com listing the agents you choose to advertise, and each agent's record carrying parameters such as alpn (the protocol suite, for example mcp,h2,h3), a bap agent-protocol hint such as a2a or mcp, and a well-known path (RFC 8615) pointing at the agent manifest. An agent resolves the _agents records, then follows them to the endpoint. The records should be DNSSEC signed so a resolver can confirm they are authentic, which is exactly why DNSSEC matters here too.

Keep the status in proportion. DNS-AID is an individual draft that, in its authors' own words, has no formal standing in the IETF process. It is not the only proposal in this space, and mainstream AI platforms are not yet resolving these records at scale. So treat it as a forward-looking, early-adopter control: it is cheap to publish (DNSSEC-signed SVCB records, using a record type your DNS provider may already support) and it positions you for agent-to-agent discovery, but do not expect broad consumption yet, and do not let it crowd out the controls that AI systems actually read today.

Whether or not you publish DNS-AID now, the signals that are broadly consumed today are where most of the value sits:

  1. Get your AI-crawler rules right. Decide your stance per bot and write robots.txt rules against the current stable tokens, not deprecated ones.
  2. Add a Content Signals line that matches your intent. Make the search, ai-input and ai-train values agree with your per-bot allow and disallow rules.
  3. Publish a consistent llms.txt map if you want one, with no page that contradicts robots.txt.
  4. Fix the DNS identity layer. Clean SPF, DKIM and DMARC, MTA-STS where it fits, and DNSSEC underneath. This is the part that genuinely establishes who your domain is.

Because every one of these signals is advisory and your DNS posture drifts over time, the durable control is continuous monitoring. Records get edited, certificates lapse, a Content Signals line and a robots.txt rule quietly fall out of step, and a DNSSEC signature can expire under an _agents entry point you published months ago. You want to catch that drift rather than discover it from an AI answer that misrepresents your brand.

Frequently asked questions

Do AI crawlers read my llms.txt to decide how to rank or cite my site?

No. As of 2026 no major search engine or AI answer engine treats a third-party site's llms.txt as a crawl, indexing, ranking or citation signal. Google has stated on the record that it does not support it and is not planning to, and server-log analysis shows AI bots request the file very rarely. It is an unratified community proposal, useful as a convenience map but not a visibility lever.

Is there a DNS record I can publish so AI systems discover and trust my domain?

There is an emerging one. DNS for AI Discovery (DNS-AID), an active IETF Internet-Draft (draft-mozleywilliams-dnsop-dnsaid), lets you advertise AI agents in DNS using SVCB records (RFC 9460) under an _agents label, for example _index._agents.example.com, signed with DNSSEC. It reuses an existing record type rather than inventing a new one. It is still an individual draft with no formal IETF standing and only early adoption, so it is a forward-looking option rather than something AI systems broadly consume yet. The DNS records that establish verifiable identity today remain the email-authentication ones (SPF, DKIM, DMARC) plus DNSSEC.

What is the difference between robots.txt and Cloudflare Content Signals?

robots.txt controls whether a crawler may access your pages. The Content Signals Policy, launched in September 2025, adds a Content-Signal line stating how content may be used after access, across three signals: search, ai-input and ai-train. Both are stated preferences, not technical blocks; a non-compliant crawler can ignore either, and real blocking needs edge controls such as a web application firewall.

Is anthropic-ai still Anthropic's crawler?

No. anthropic-ai and claude-web are deprecated legacy tokens that no longer affect current Claude traffic. Anthropic's current crawlers are ClaudeBot for training, Claude-User for user-initiated fetches and Claude-SearchBot for search, all of which honour robots.txt. Target ClaudeBot for the effective control.

How does MTA-STS show the way a future AI trust record might work?

MTA-STS splits the job cleanly. A short DNS TXT record at _mta-sts announces that a policy exists and which version it is, while the enforceable policy is served over HTTPS at a fixed well-known path with a valid certificate. DNS announces, HTTPS serves. That pattern, anchored by DNSSEC for trust, is the most likely template for any future AI or agent trust record, even though none has been standardised yet.

Is your DNS identity layer actually clean?

The AI file layer will keep changing, but the records that prove who your domain is are here now. Check SPF, DKIM, DMARC, MTA-STS, DNSSEC and TLS in one scan with our free Security Grade check. ShieldMarc is built for UK MSPs and multi-domain IT teams: start with a single domain for free, then monitor every client domain for drift, expiry and policy regressions from one place.

Next steps