Robots.txt & Crawl Budget for AI

A textile exporter in Bangladesh noticed that after adding a new product catalogue with hundreds of pages, their server started slowing down noticeably during business hours. Checking the logs revealed the culprit: GPTBot was crawling deep into old archived blog posts, PDF price lists from three years ago, and dozens of admin-adjacent URLs that should never have been publicly accessible. The bot was consuming precious server resources on irrelevant pages while the exporter's new flagship product pages went unvisited by AI crawlers. Without a proper robots.txt strategy, the exporter had no control over which content AI crawlers prioritized.

Robots.txt is the oldest protocol for communicating with web crawlers, but its application to AI crawlers requires a fresh approach. Traditional search bots generally respect robots.txt directives and can be managed with broad rules. AI crawlers are a more diverse ecosystem with varying levels of compliance, different user-agent strings, and different crawl patterns. Understanding how to configure robots.txt for each major AI crawler — while balancing crawl budget across your most important export pages — is essential for ensuring your best content gets indexed and your server stays responsive.

Understanding AI Crawler User Agents

Each major AI company operates its own crawler with a distinct user-agent string. OpenAI's crawler is identified as GPTBot and also uses ChatGPT-User for interactions initiated through the ChatGPT interface. Google's AI crawler is Google-Extended, which is separate from the standard Googlebot and specifically powers AI training and Gemini-based products. Anthropic operates Claude-Web, which crawls content for Claude's training and retrieval. Perplexity uses PerplexityBot, and Common Crawl — a major data source for many AI models — uses CCBot. Additional crawlers include Bytespider (ByteDance), Applebot-Extended (Apple Intelligence), and FacebookBot (Meta AI).

These user agents have different default behaviors and levels of robots.txt compliance. GPTBot and Google-Extended are generally well-behaved and respect Disallow directives consistently. Claude-Web and PerplexityBot have shown varying levels of compliance in practice, though both claim to follow robots.txt standards. CCBot, which powers Common Crawl, has historically been one of the most aggressive crawlers in terms of request volume. It is important to test your robots.txt configuration rather than assuming compliance. Use your server access logs to verify that each crawler is respecting your directives — if a crawler ignores your rules, you may need to block it at the server level using IP-based restrictions.

Exporters should also be aware that AI crawler user agents evolve rapidly. OpenAI introduced GPTBot in mid-2023 and has updated its behavior multiple times since. Google-Extended launched in late 2023 and has expanded its scope as Gemini products have grown. Keeping your robots.txt current requires periodic review of the latest crawler documentation from each provider. A quarterly audit of your robots.txt file against current AI crawler user-agent lists is a sensible maintenance routine.

Managing Crawl Budget for AI Bots

Crawl budget refers to the number of pages an AI crawler will visit on your site within a given time period. Unlike Googlebot, which allocates crawl budget based on site authority, page freshness, and server capacity, AI crawlers often apply simpler heuristics. They may crawl based on a fixed rate limit, a maximum number of pages per session, or a time-constrained window. If your site has thousands of pages but only fifty of them contain your core export offerings, AI crawlers may waste their budget on low-value pages if you do not explicitly guide them.

The most effective way to conserve crawl budget is to use the Disallow directive in robots.txt to block AI crawlers from accessing non-essential sections. Common candidates for blocking include archive pages, tag and category filters, internal search results, pagination sequences beyond page two, PDF files, image galleries, and any admin or staging directories. For example, a directive like Disallow: /pdf/ prevents GPTBot from downloading every price list and catalogue PDF, preserving its crawl budget for your actual product pages. Similarly, Disallow: /blog/page/ blocks deep pagination into blog archives that rarely contain primary export content.

You can also use the Crawl-Delay directive specifically for AI crawlers to reduce server load. This tells the crawler to wait a specified number of seconds between requests. For aggressive crawlers like CCBot, setting a Crawl-Delay of 10 to 30 seconds can dramatically reduce server strain while still allowing the crawler to access your content. Be aware that not all AI crawlers support Crawl-Delay — GPTBot does, for example, while some smaller crawlers may ignore it. The directive is nonetheless worth including as a best practice.

Prioritizing Pages for AI Crawling

Once you have blocked non-essential content, you need to actively promote your most important pages. The primary mechanism for this is your XML sitemap. While robots.txt tells crawlers what to avoid, your sitemap tells them what matters most. Submit a clean, well-organized sitemap that includes only your priority pages — product catalogues, category hubs, service pages, and key landing pages. Use the <priority> and <changefreq> tags to signal relative importance, though note that these are hints rather than commands. A sitemap with 50 well-chosen URLs is far more effective than one with 5,000 URLs that includes every blog post and tag page.

Your sitemap should be referenced in your robots.txt file using the Sitemap: directive. This is the standard way to point all crawlers — including AI crawlers — to your preferred content. Place the Sitemap directive at the top of your robots.txt file, before any user-agent blocks, to ensure it is seen by all crawlers regardless of their specific rules. For exporters with multilingual sites, include separate sitemaps for each language version and use hreflang annotations to help AI crawlers understand which version to index for which audience.

Finally, use the Allow directive in combination with Disallow to create precise inclusion rules. For example, you might disallow all AI crawlers from your entire blog section but then specifically allow access to a single high-value comparison article that ranks well in AI search results. The pattern Disallow: /blog/ followed by Allow: /blog/export-market-comparison-2026/ gives you surgical control over crawl access. This granularity is especially valuable for exporters with mixed-content sites where product pages and blog content coexist but should be treated differently by AI crawlers.

Do This Now

Review your current robots.txt file and add specific user-agent blocks for GPTBot, Google-Extended, Claude-Web, PerplexityBot, and CCBot.
Disallow AI crawlers from non-essential sections (archives, PDF directories, tag pages, search results) to conserve crawl budget.
Create a clean XML sitemap containing only your top 50 priority export pages and reference it in robots.txt using the Sitemap directive.
Set Crawl-Delay directives for aggressive crawlers and verify compliance by monitoring your server logs over the following week.

Frequently Asked Questions

Will blocking AI crawlers from parts of my site hurt my visibility in AI search results?

Only if you block the wrong pages. Blocking low-value content like archives, duplicate pages, and admin URLs preserves crawl budget for your important pages. The key is to carefully select which sections to block and which to prioritize.

How do I know which AI crawlers are visiting my site before configuring robots.txt?

Check your server access logs for known AI crawler user-agent strings. You can also use tools like Cloudflare's bot management or Google Search Console's crawl stats to identify AI crawler traffic patterns.

Should I block all AI crawlers except Google-Extended?

Generally not recommended unless you have specific concerns about data usage. Each AI crawler represents a potential channel through which buyers discover your products. Selective blocking is better than blanket blocking. Focus on guiding crawlers to your best content rather than excluding them entirely.