What is robots.txt in SEO? 2026 Guide with Examples & AI Bots

Quick answer

robots.txt explained for 2026 — syntax, examples, AI crawler controls (GPTBot, Google-Extended, PerplexityBot), and the dangerous mistakes that quietly deindex your site.

robots.txt is one of the smallest files on your website — but a single typo can quietly de-rank your entire site for months. Here's a clear 2026 guide for marketers, SEOs and developers, including the new AI crawlers you now need to think about.

Quick answer

robots.txt sits at yourdomain.com/robots.txt and tells well-behaved crawlers what they can and can't crawl. In 2026 it must also handle AI bots like GPTBot, Google-Extended, ClaudeBot and PerplexityBot. It does NOT prevent indexing of pages that get linked externally — for that, use a noindex meta tag.

What robots.txt actually does (and doesn't)

✓ Tells crawlers which paths to skip
✓ Saves crawl budget on huge sites
✓ Points search engines to your XML sitemap
✗ Does NOT block indexing of externally-linked URLs
✗ Does NOT enforce security — anyone can read your robots.txt
✗ Does NOT stop malicious or non-compliant bots

Basic syntax

Each rule has a User-agent (which bot) and one or more Allow / Disallow lines (what it can do). Example for a typical SMB site:

User-agent: * Disallow: /admin/ Disallow: /cart/ Allow: / Sitemap: https://yourdomain.com/sitemap.xml

AI crawlers you should know in 2026

GPTBot — OpenAI's crawler (training and ChatGPT browsing)
Google-Extended — controls inclusion in Gemini & AI Overviews training (separate from Googlebot)
ClaudeBot — Anthropic's crawler
PerplexityBot — Perplexity's crawler for live answers + citations
CCBot — Common Crawl, used to train many open-source models

Should you allow or block AI bots?

For most service businesses (agencies, clinics, schools, B2B), allow them — being cited inside ChatGPT and Perplexity now sends real referral traffic and shapes brand perception. For premium publishers and original-research sites, blocking GPTBot/CCBot is a valid choice if you don't want your work training competitors' models. Google-Extended is usually allowed if you want to appear in AI Overviews.

Example: allow Google + Perplexity, block training-only bots

User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: / User-agent: PerplexityBot Allow: / User-agent: Google-Extended Allow: / User-agent: * Disallow: /admin/ Sitemap: https://yourdomain.com/sitemap.xml

Common commands

User-agent — the bot the rule applies to (* = all)
Disallow — paths the bot must skip
Allow — explicit permission inside a disallowed folder
Sitemap — absolute URL of your XML sitemap (can be repeated)
Crawl-delay — suggested delay between requests (Googlebot ignores; Bing respects)

Best practices for 2026

Keep the file under 500KB and at the root of every domain & subdomain
Always include a Sitemap directive — saves indexing time
Use noindex meta tags (not robots.txt) to remove pages from search
Test in Google Search Console's robots.txt report after every deploy
Decide your AI bot stance deliberately — don't leave it to defaults
Review at every site migration; staging Disallow: / is the #1 reason new sites don't rank

How robots.txt and XML sitemaps work together

robots.txt tells crawlers what NOT to visit. XML sitemaps tell them what TO visit. Together they shape your crawl budget. For most SMB sites under 500 pages, crawl budget isn't a concern — but the sitemap declaration in robots.txt still helps search engines discover new content faster.

Debugging when something breaks

If pages drop from Google: (1) check robots.txt for an accidental Disallow, (2) inspect the URL in Search Console, (3) check x-robots-tag headers, (4) check meta robots tags in the HTML. 9 out of 10 "my site disappeared" cases trace back to one of these four.

robots.txt is 5 lines of text that can either save your SEO or quietly destroy it. Treat it like a deployment checkpoint, not a one-time setup file.

Need help with technical SEO?

Pacewalk handles full technical SEO for Indian and overseas brands — robots.txt, sitemaps, schema, Core Web Vitals, AEO. Explore our SEO services or talk to our team for a free technical audit. Also read Indexing vs crawling in SEO and The crucial role of sitemaps in SEO.

Need help with this?

Talk to Pacewalk — free 30-min strategy call

Our Chandigarh team has done it for 1854+ brands. No obligation, no sales pressure.

Get free quote Call now WhatsApp now

12+

Years experience

1854+

Projects delivered

1277+

Happy clients

Serving Chandigarh, Mohali, Zirakpur, Panchkula & Punjab

Frequently asked

Questions people ask about this

What is robots.txt and why does it matter?

robots.txt is a small text file at yourdomain.com/robots.txt that tells web crawlers which parts of your site they're allowed (or asked) to crawl. It doesn't enforce security or stop indexing on its own — it's a polite instruction respected by good bots like Googlebot, ignored by malicious ones.

Should I block AI bots like GPTBot in 2026?

Depends on your goal. Block them if you don't want your content used to train AI models (publishers often do this). Allow them if you want to be cited inside ChatGPT, Perplexity and Google AI Overviews — most service businesses benefit from being cited.

Will robots.txt stop a page from showing in Google?

No. Disallow tells Google not to crawl, but Google can still index the URL based on external links — usually as a bare URL with no description. To stop indexing, use a noindex meta tag or x-robots-tag header instead.

What's the most dangerous robots.txt mistake?

"Disallow: /" — it blocks the entire site from crawling. We've seen brand-new sites stay invisible for months because a developer forgot to remove it after staging. Always validate at yourdomain.com/robots.txt and Google Search Console after every deploy.

Written by

Pacewalk Editorial

Experts in SEO, websites, branding and digital growth solutions. 12+ years helping Indian businesses scale online.

Work with us More articles

Related services

SEO Services Web Designing Talk to our team

Ready to grow your business?

Tell us about your project — we'll respond within 2 hours.

Full name

Mobile number*

10-digit Indian mobile.

Service interested

Message (optional)

I agree to the Terms & Conditions and Privacy Policy.

By submitting you agree to be contacted by Pacewalk.

What is robots.txt in SEO? 2026 Guide with Examples & AI Bots

Quick answer

What robots.txt actually does (and doesn't)

Basic syntax

AI crawlers you should know in 2026

Should you allow or block AI bots?

Example: allow Google + Perplexity, block training-only bots

Common commands

Best practices for 2026

How robots.txt and XML sitemaps work together

Debugging when something breaks

Need help with technical SEO?

Questions people ask about this

Ready to grow your business?

Keep reading

17 Effective SEO Techniques to Rank Higher in 2026 (India Guide)

How to Choose the Best SEO Company in Chandigarh (2026 Guide)

Local SEO for Zirakpur Businesses: Get More Calls & Walk-ins