Two layers, one domain
When people ask how to make a domain “discoverable to AI”, they are usually mixing two things that live in different places:
- The HTTP file layer. Files an AI system fetches over the web:
robots.txt, anllms.txtmap, a Content Signals line, and agent manifests under/.well-known/. This is where almost all of the deployed controls and current proposals live. - The DNS layer. Records that prove identity and carry trust: the email-authentication records (SPF, DKIM, DMARC), MTA-STS, and DNSSEC as the cryptographic anchor underneath them. This is where verifiable trust already works, and where a record built specifically for AI is just emerging as an early draft (DNS-AID), not yet a ratified standard.
The honest headline: most of what is called “AI discovery” today is plain HTTP fetching governed by robots.txt, not a new AI-specific protocol. Some of the rest is preference signalling that is honoured only voluntarily, and some is still an unratified draft. Knowing which is which stops you from chasing things that do nothing.
Who actually crawls you: the AI user-agents that matter in 2026
Before you can control AI access, you need to know who is asking. Each major AI vendor runs several distinct bots, and they do different jobs. Treating them as one blob is the most common mistake. The reliable rule of thumb: bots that crawl to build a training set or a search index generally honour robots.txt, while fetchers triggered by a human in real time often do not, because a person asked for that specific page.
| Token | Operator | What it does | Honours robots.txt |
|---|---|---|---|
| GPTBot | OpenAI | Crawls content that may train foundation models | Yes |
| OAI-SearchBot | OpenAI | Surfaces sites in ChatGPT search results | Yes |
| ChatGPT-User | OpenAI | User-initiated fetch from a ChatGPT session | User-triggered, rules may not apply |
| ClaudeBot | Anthropic | Crawls content for model training | Yes |
| Claude-SearchBot | Anthropic | Builds the Claude search index | Yes |
| Claude-User | Anthropic | User-initiated fetch from a Claude session | Yes |
| Google-Extended | Control token to opt out of Gemini training and grounding (not a crawler) | Control token only | |
| PerplexityBot | Perplexity | Indexes and links sites in answers | Yes |
| Perplexity-User | Perplexity | User-initiated fetch from a Perplexity session | Generally does not |
| CCBot | Common Crawl | Builds the public Common Crawl corpus, a long-standing training source for LLMs | Yes |
Two details catch people out. First, Google-Extended is not a crawler. It has no user-agent string of its own and never fetches a page. It is a label you put in robots.txt to say whether content Google already crawled may be used to train Gemini or for grounding. Google states it does not affect your inclusion or ranking in Google Search. Applebot-Extended works the same way for Apple's AI training.
Second, match on the stable token, not the version string. Anthropic's old tokens anthropic-ai and claude-web are deprecated and no longer affect live Claude traffic, so a rule written against them does almost nothing today. The effective control for Anthropic is ClaudeBot. Block or allow by the stable name and ignore the version suffix in the full user-agent string.
robots.txt is still the lever
For all the talk of new AI standards, the single most broadly honoured control is the oldest one. The Robots Exclusion Protocol (standardised as RFC 9309) is what every major AI vendor points site owners to when they ask how to opt in or out. You allow or disallow each bot by its token:
User-agent: GPTBot Disallow: /private/ User-agent: ClaudeBot Disallow: /private/ User-agent: CCBot Disallow: /
The thing to internalise is what robots.txt is not. It is advisory. Honouring it is voluntary, the user-initiated fetchers above may bypass it, and a scraper with no interest in the rules will simply ignore the file. It is a sign on the door, not a lock. If you need to actually prevent access you need edge controls such as a web application firewall, bot management or rate limiting, which sit in front of the site and enforce in a way a text file never can.
The practical hygiene point matters more than any single rule: keep robots.txt, your Content Signals line and any llms.txt consistent. It is surprisingly easy to disallow a path in one place while advertising the same path in another, which tells AI operators two different stories about the same content.
Content Signals: stating how your content may be used
robots.txt controls whether a bot may access a page. It says nothing about what happens to the content afterwards. Cloudflare's Content Signals Policy, launched on 24 September 2025 as a free, openly licensed extension to robots.txt, fills that gap. It adds a Content-Signal line with three independent yes or no signals:
- search: building a search index and returning links and short excerpts. This explicitly excludes AI-generated summaries, so
search=yesdoes not permit AI answers. - ai-input: real-time use such as retrieval-augmented generation or grounding a generated answer.
- ai-train: training or fine-tuning AI models.
Content-Signal: search=yes, ai-input=yes, ai-train=no
Like robots.txt, this is a stated preference, not a technical block. Cloudflare is explicit that content signals express preferences and are not countermeasures against scraping. A non-compliant operator can ignore the line with no technical barrier, and Cloudflare frames the restriction as a reservation of rights rather than a tested legal ruling. Two subtleties are worth getting right: a blank signal means no preference was expressed through this mechanism, not that you have no preference; and because the three signals are independent, you can welcome search indexing while declining model training.
This space is also where the only live standards body work is happening. The IETF AI Preferences working group is standardising a vocabulary for exactly these usage preferences, carried over robots.txt, HTTP response headers and /.well-known/ URIs. It is still draft work, and notably it uses no DNS at all.
llms.txt: a useful map, not a ranking signal
llms.txt is a Markdown file served from your site root, proposed by Jeremy Howard of Answer.AI in September 2024. It borrows its location convention from robots.txt and sitemap.xml: a single H1 with your site name, an optional summary, then curated links to your most useful pages. A companion llms-full.txt is a different idea again, a single concatenated dump of your content meant to be loaded straight into a context window.
Here is the part the hype usually leaves out. It is an unratified one-author proposal, not a web standard, and no major search engine or AI answer engine consumes a third-party llms.txt as a crawl, ranking or citation signal. Google has said on the record that it does not support the file and is not planning to, and independent server-log analysis shows AI bots request it very rarely. When you see Anthropic or Perplexity publishing their own llms.txt, that is documentation delivery for coding agents (often generated automatically by their docs platform), not their crawlers consuming other sites' files.
So why publish one at all? Because it is cheap, and it is a clean, structured map of your guides and tools for any agent that does choose to fetch it. Treat it as a convenience with a small upside, do not expect it to move visibility, and keep it consistent with robots.txt so you never list a page you also disallow. Anyone selling llms.txt as a ranking lever is overstating the evidence.
Agents, not crawlers: discovery via .well-known
Crawlers read your pages. AI agents need something different: a machine-readable description of what a service can do and how to call it. That discovery uses the /.well-known/ path, reserved by RFC 8615. That RFC is generic plumbing, not an AI-specific standard, and it predates the current wave by years.
- Agent Cards (A2A).The Agent2Agent protocol publishes a JSON Agent Card, describing an agent's identity, endpoint, capabilities and required authentication, fetched by a plain HTTP GET from a well-known URI. A2A originated at Google in April 2025 and was contributed to the Linux Foundation in June 2025, so it now sits under vendor-neutral governance. The path moved from
/.well-known/agent.jsonto/.well-known/agent-card.jsonin version 0.3.0, with the older path still widely supported. - The Model Context Protocol (MCP). A common misconception is that MCP servers are discovered through a public well-known file. They are not. An MCP client is pointed at a known server URL or a local process. Where
/.well-known/appears in MCP it is OAuth authorisation metadata, used only after the server URL is already known, and authorisation is optional. - The retired plugin manifest. The old ChatGPT plugin system used a
/.well-known/ai-plugin.jsonmanifest. That system was sunset in 2024 and replaced by custom GPTs. Stale manifest files still litter the web, but they are not a live discovery method.
None of these is an IETF standard. A2A and MCP are open industry specifications. Only the underlying plumbing (RFC 8615 and the OAuth metadata RFCs) is standards-track.
The DNS layer: identity and trust, with email as the precedent
Everything above answers “can you read this, and how may you use it?” DNS answers a harder question: “is this domain really who it claims to be?” That is where cryptographically anchored identity already works today, and email authentication is the proven pattern any future AI-trust record would most likely follow. Each control is a DNS record in a predictable place:
| Control | Where it lives in DNS |
|---|---|
| SPF (RFC 7208) | A TXT record at the domain apex, starting v=spf1 |
| DKIM (RFC 6376) | A TXT record at selector._domainkey |
| DMARC | A TXT record at _dmarc |
| MTA-STS (RFC 8461) | A TXT pointer at _mta-sts, policy served over HTTPS |
| BIMI | A TXT record at default._bimi pointing to an SVG logo |
MTA-STS is the cleanest analogy for how an AI-trust record would probably work. A short TXT record at _mta-sts.<domain> carries only a version and an id, just enough for a sender to notice that a policy exists and whether its cached copy is current. The enforceable policy itself is served over HTTPS from /.well-known/mta-sts.txt with a valid certificate. DNS announces, HTTPS serves. If a future standard ever lets an AI agent verify a domain's policy, that DNS-announces-HTTPS-serves shape, anchored by DNSSEC, is the obvious template. You can read the full mechanism in our MTA-STS and TLS-RPT guide.
The piece that makes any of this trustworthy is DNSSEC. Without it, DNS answers can be forged in transit, so a signal published in DNS is only as dependable as the chain of trust above it. DANE, which binds a TLS certificate to a domain through a TLSA record, provides no security at all unless DNSSEC is in place. That is why DNSSEC, not any AI-specific record, is the foundation a domain owner should care about first.
BIMI shows both the appeal and the limits of this layer. A TXT record at default._bimi points to your logo, but inbox display is gated behind a paid mark certificate referenced in the record. Gmail will not show a logo from a self-asserted record alone: it needs either a Verified Mark Certificate, which requires a registered trademark and also earns the verified checkmark, or, since October 2024, a Common Mark Certificate, which needs no trademark but shows the logo without the checkmark. Those certificates run on the order of roughly £600 to £1,300 a year. Some inboxes, including Apple Mail and Yahoo, display self-asserted logos with no certificate. Our BIMI guide covers the trade-offs.
The DNS layer is catching up: DNS for AI Discovery (DNS-AID)
Until recently there was no way to announce your AI or agent endpoints in DNS the way _dmarc announces your email policy. That gap is now being filled. The leading proposal is DNS for AI Discovery (DNS-AID), an active IETF Internet-Draft (draft-mozleywilliams-dnsop-dnsaid), and it is worth understanding even though it is early.
DNS-AID does not invent a new record type. It is a naming convention layered on the existing SVCB service-binding record (RFC 9460). You publish ServiceMode SVCB records under an _agents leaf, with _index._agents.example.com listing the agents you choose to advertise, and each agent's record carrying parameters such as alpn (the protocol suite, for example mcp,h2,h3), a bap agent-protocol hint such as a2a or mcp, and a well-known path (RFC 8615) pointing at the agent manifest. An agent resolves the _agents records, then follows them to the endpoint. The records should be DNSSEC signed so a resolver can confirm they are authentic, which is exactly why DNSSEC matters here too.
Keep the status in proportion. DNS-AID is an individual draft that, in its authors' own words, has no formal standing in the IETF process. It is not the only proposal in this space, and mainstream AI platforms are not yet resolving these records at scale. So treat it as a forward-looking, early-adopter control: it is cheap to publish (DNSSEC-signed SVCB records, using a record type your DNS provider may already support) and it positions you for agent-to-agent discovery, but do not expect broad consumption yet, and do not let it crowd out the controls that AI systems actually read today.
Whether or not you publish DNS-AID now, the signals that are broadly consumed today are where most of the value sits:
- Get your AI-crawler rules right. Decide your stance per bot and write
robots.txtrules against the current stable tokens, not deprecated ones. - Add a Content Signals line that matches your intent. Make the
search,ai-inputandai-trainvalues agree with your per-bot allow and disallow rules. - Publish a consistent llms.txt map if you want one, with no page that contradicts
robots.txt. - Fix the DNS identity layer. Clean SPF, DKIM and DMARC, MTA-STS where it fits, and DNSSEC underneath. This is the part that genuinely establishes who your domain is.
Because every one of these signals is advisory and your DNS posture drifts over time, the durable control is continuous monitoring. Records get edited, certificates lapse, a Content Signals line and a robots.txt rule quietly fall out of step, and a DNSSEC signature can expire under an _agents entry point you published months ago. You want to catch that drift rather than discover it from an AI answer that misrepresents your brand.
Frequently asked questions
Do AI crawlers read my llms.txt to decide how to rank or cite my site?
No. As of 2026 no major search engine or AI answer engine treats a third-party site's llms.txt as a crawl, indexing, ranking or citation signal. Google has stated on the record that it does not support it and is not planning to, and server-log analysis shows AI bots request the file very rarely. It is an unratified community proposal, useful as a convenience map but not a visibility lever.
Is there a DNS record I can publish so AI systems discover and trust my domain?
There is an emerging one. DNS for AI Discovery (DNS-AID), an active IETF Internet-Draft (draft-mozleywilliams-dnsop-dnsaid), lets you advertise AI agents in DNS using SVCB records (RFC 9460) under an _agents label, for example _index._agents.example.com, signed with DNSSEC. It reuses an existing record type rather than inventing a new one. It is still an individual draft with no formal IETF standing and only early adoption, so it is a forward-looking option rather than something AI systems broadly consume yet. The DNS records that establish verifiable identity today remain the email-authentication ones (SPF, DKIM, DMARC) plus DNSSEC.
What is the difference between robots.txt and Cloudflare Content Signals?
robots.txt controls whether a crawler may access your pages. The Content Signals Policy, launched in September 2025, adds a Content-Signal line stating how content may be used after access, across three signals: search, ai-input and ai-train. Both are stated preferences, not technical blocks; a non-compliant crawler can ignore either, and real blocking needs edge controls such as a web application firewall.
Is anthropic-ai still Anthropic's crawler?
No. anthropic-ai and claude-web are deprecated legacy tokens that no longer affect current Claude traffic. Anthropic's current crawlers are ClaudeBot for training, Claude-User for user-initiated fetches and Claude-SearchBot for search, all of which honour robots.txt. Target ClaudeBot for the effective control.
How does MTA-STS show the way a future AI trust record might work?
MTA-STS splits the job cleanly. A short DNS TXT record at _mta-sts announces that a policy exists and which version it is, while the enforceable policy is served over HTTPS at a fixed well-known path with a valid certificate. DNS announces, HTTPS serves. That pattern, anchored by DNSSEC for trust, is the most likely template for any future AI or agent trust record, even though none has been standardised yet.
Is your DNS identity layer actually clean?
The AI file layer will keep changing, but the records that prove who your domain is are here now. Check SPF, DKIM, DMARC, MTA-STS, DNSSEC and TLS in one scan with our free Security Grade check. ShieldMarc is built for UK MSPs and multi-domain IT teams: start with a single domain for free, then monitor every client domain for drift, expiry and policy regressions from one place.
Next steps
- Inspect any domain's live records with our DNS Lookup tool.
- Confirm your chain of trust with the DNSSEC Checker.
- Read DNS record types explained for a full reference on the records this guide touches.
- Understand how the email-authentication records fit together in SPF vs DKIM vs DMARC.