Crawlability and Indexing for Export Sites

Carlos Rojas runs a Colombian coffee export company that sources from small farms in the Huila region. His website had beautifully translated Spanish and English versions, detailed product pages for each coffee lot, and an extensive blog about sustainable farming. Yet after six months, Google had indexed only 30 of his 200 English-language pages. A quick inspection revealed that his English subdirectory was blocked by a wildcard rule in robots.txt that his developer had added during a staging migration and never removed. Googlebot was being turned away from the entire English section of the site without Carlos ever knowing.

Crawlability refers to the ability of search engine bots to access and navigate your website. If Googlebot cannot reach a page, it cannot be indexed, and if it cannot be indexed, it cannot rank. For export sites with multiple language versions, specialized product catalogs, or complex navigation, crawlability issues are one of the most common and damaging technical SEO problems. A single misconfigured robots.txt rule or a chain of noindex tags can keep your most important export pages invisible to search engines for months.

This lesson covers crawl budget management, robots.txt best practices for international sites, and how to monitor indexing status in Google Search Console. You will learn how to ensure that every page you want ranked is discoverable and eligible for indexing.

Understanding Crawl Budget for Multi-Language Sites

Crawl budget is the number of URLs Googlebot will crawl on your site within a given timeframe. For large export sites with dozens of language versions and thousands of product pages, crawl budget is a real constraint. Google allocates more crawl budget to sites it considers important and well-maintained, but it will still prioritize certain pages over others. If Googlebot spends its crawl budget on duplicate category filter pages, printer-friendly templates, or paginated archives, it may never reach your newest product pages.

Maximize your crawl budget by blocking low-value URLs in robots.txt. Common candidates include sort and filter parameters (e.g., ?sort=price&color=red), internal search result pages, printer-friendly versions, and staging or test directories. For multi-language sites, be careful not to block entire language subdirectories unless they contain truly duplicate content. Use the noindex directive sparingly -- prefer robots.txt blocking for low-value crawl fodder and use meta robots noindex for pages you want to keep out of the index but still want Google to see (such as thin affiliate pages).

Monitor your crawl statistics in Google Search Console under the Settings > Crawl Stats report. Look for trends in total crawl requests, average response time, and kilobytes downloaded per page. A sudden drop in crawl activity can indicate a technical problem, while a high average response time suggests your server may be too slow for Googlebot, causing it to back off. Export sites hosted in one region but targeting users in another should consider using a CDN or a server closer to their target market to improve crawl efficiency.

Robots.txt Best Practices for International SEO

Your robots.txt file is the first thing Googlebot checks when it arrives on your site. A well-configured robots.txt tells the crawler which areas to explore and which to skip. For export sites, the most common mistakes are blocking entire language subdirectories unintentionally and using disallow rules that are too broad. Always test your robots.txt rules using the robots.txt Tester in Google Search Console before deploying.

Structure your robots.txt to allow all language versions of your main content while blocking administrative and duplicate content areas. A typical export site robots.txt might allow crawling of /en/, /de/, /fr/, /es/ but disallow /en/blog/tags/, /de/search/, /admin/, and any URL parameters that generate duplicate or thin content. Use the Allow directive to override a broader Disallow when needed. For example, you might disallow /en/ but then allow /en/products/ -- though this is rarely the right pattern for export sites.

Remember that robots.txt is a directive, not a guarantee. Google may still index pages blocked by robots.txt if it discovers them through external links, but it will not crawl them, so the indexed content may be outdated. Never rely on robots.txt to keep sensitive pages private -- use authentication for that. For export sites with a staging or development subdomain, add a robots.txt with Disallow: / on the staging environment to prevent Google from indexing test content.

Monitoring Indexing in Google Search Console

Google Search Console's Indexing report shows you exactly which pages from your site are in Google's index and which are not, along with the reason for exclusion. For export sites, filter this report by language version or subdirectory to see how each market's content is performing. If your German pages show a high number of "Crawled - currently not indexed" entries, it could mean Google found the pages but deemed them low quality or duplicative of other content.

Use the URL Inspection tool to check individual pages. Enter a URL from each language version and verify that Google can access it, render it, and understand its canonical URL. The tool will also show you the last crawl date, any indexing errors, and the page's mobile usability status. For export sites, check at least one representative URL from each language version every week to catch issues early.

Set up email alerts in Google Search Console for critical issues like a sudden drop in indexed pages, a spike in 404 errors, or manual actions. These alerts can save you weeks of lost visibility. When you discover indexing issues, fix them immediately and use the "Request Indexing" button in the URL Inspection tool to prompt Google to recrawl the affected pages. For large batches of updated pages, submit a new sitemap or ping Google via the Indexing API.

Do This Now

Open your site's robots.txt and review every Disallow rule. Make sure no export-relevant subdirectories or language versions are blocked. Test all rules in Google Search Console's robots.txt Tester.
Check the Indexing report in Google Search Console. Filter by the subdirectory for each language version and note the number of pages "Crawled - currently not indexed" versus indexed.
Use the URL Inspection tool to test one representative page from each language version. Confirm Google can access, render, and index each page, and note the canonical URL Google has selected.
Review your Crawl Stats report. If your total daily crawl requests are dropping or your server response time is over 200ms, address those issues to improve crawl efficiency.

Frequently Asked Questions

What exactly is crawl budget and does my small export site need to worry about it?

Crawl budget is the number of URLs Googlebot will crawl on your site per crawl cycle. For small sites with under 500 pages, crawl budget is rarely an issue. But as you add multiple language versions and a large product catalog, the total URL count can grow quickly. If you have 3,000 total URLs across five languages and Google only crawls 200 per day, important pages may wait weeks to be discovered. Blocking low-value URLs helps focus the budget on pages that matter.

Should I use noindex or robots.txt to prevent pages from being indexed?

Use noindex for pages you want Google to see but not index -- for example, thin affiliate pages, thank-you pages, or content with little unique value. Use robots.txt Disallow for pages Google should not even crawl -- like admin sections, staging environments, or infinite pagination archives. The key difference is that noindex keeps the indexed version out of search results but still passes some link equity, while Disallow prevents crawling entirely.

How often does Google recrawl pages on an export site?

Recrawl frequency depends on Google's assessment of your site's importance and how often content changes. A typical export site can expect recrawls every few days to a few weeks for the homepage and important product pages. Less important or rarely updated pages may go months between recrawls. You can prompt faster recrawling by pinging Google via the Indexing API, submitting sitemap updates, or getting new external links pointing to the URL.