Skip to content

How to limit AI access to your content without disappearing from the web (Part 2 - Technical strategies)

Published:

limit access

Total block, intermediate block, and alternative content: how to stay visible without giving away the substance (or the tokens).

Series:

Table of Contents

Open table of contents

From theory to practice

In Part 1 I defended a simple idea: if you block blindly, you can disappear; if you leave everything open, you give away the substance and we all pay the cost without control. And that cost is not just “tokens”: it is also environmental cost (electricity, emissions associated with computation and, in many cases, water for cooling), plus repeated traffic and storage every time a bot crawls again.

In this second part I get into the technical side: two realistic strategies (total block and intermediate block) and a third key piece, showing alternative content. Part 3 goes deeper into how to adapt that alternative content to further reduce token consumption.

Important: no measure is 100% reliable. The goal is to reduce surface and cost, and leave clear signals.

Signals before blocking: llms.txt and other clues

Before raising the wall, it is worth deciding what you want the AI to “understand” about your web.

llms.txt (useful signal, not standard)

/llms.txt is a proposal to give LLMs a “friendly” and controlled version of your site (context + key links). Today there are no guarantees of adoption: use it as a complementary hint, not as security. Even so, it is very useful in an intermediate strategy: if you limit the full HTML, it can be the “door” to offer exactly what you want them to understand.

Other useful options (without reinventing anything)

robots.txt: what it is and what it is NOT

robots.txt is a good-faith agreement: it indicates what a bot should crawl, but it does not prevent it by itself.

Two quick ideas:

  1. If a bot respects robots.txt, your file is the simplest (and compatible) way to control its behavior.

  2. If a bot does not respect robots.txt, you need enforcement layers: server rules, WAF, rate limiting, etc.

Compliance note (reputation and reporting)

As of today, OpenAI and Anthropic state that their bots are controlled via robots.txt and document how to allow/block their agents:

If you detect that a bot from these providers is not respecting your rules (and you have already verified syntax, caches, and that the User-Agent is not a spoof), report it: it is important to fix possible bugs and also for their reputation (if they promise to respect it, they have to comply).

Personal note (and a dose of realism)

My impression is that not all bots or scrapers will respect robots.txt, so it should not be your only defense.

Strategy 1: total block

This is the most direct reaction: completely block most AI bots, even though it puts us at risk of becoming invisible to those who use AI as a search engine.

Use ai-robots-txt (and do not reinvent the wheel)

The ai-robots-txt repository already has all the work done: a living list of user-agents + ready-to-copy examples. In addition to robots.txt, it includes server snippets (and the repo itself explains how to apply each one depending on your stack).

Ready-to-copy guides: robots.txt, Apache, Nginx, Caddy, HAProxy.

Practical recommendation: copy the file that fits you (or combine them: robots.txt + server blocking) and keep it updated with the project’s releases.

IP blocking (real enforcement)

If you want a harder layer than robots.txt and user-agent control, some providers publish official IP ranges. With them you can block (403) specific bots or allowlist search bots and block the rest.

Tip: always use the official URLs (lists change); do not copy third-party lists.

Where to get official IPs (JSON)

Nginx

To block AI bots by IP in Nginx, you can create a file with the ranges to block and then integrate it into the server configuration. This way, any request from those IPs will be denied automatically.

  1. Create the file /etc/nginx/ai_bot_ips.conf with the IP ranges you want to block, for example:
geo $block_ai_ip {
    default 0;

    # OpenAI GPTBot (examples)
    132.196.86.0/24 1;
    52.230.152.0/24 1;

    # PerplexityBot (examples)
    3.224.62.45/32 1;
    107.20.236.150/32 1;
}
  1. Include this file in your server {} block and apply the block:
include /etc/nginx/ai_bot_ips.conf;

if ($block_ai_ip) {
    return 403;
}

Remember: if you have a CDN or reverse proxy in front, make sure Nginx receives the real visitor IP so the block works correctly.

Apache 2.4+

In Apache 2.4+ you can block IP ranges easily using the Require not ip directive. This lets you deny access to certain IPs both in the VirtualHost configuration and at the .htaccess level if your hosting allows it. You only need to specify the ranges or addresses to block, and those not in those IPs will still be able to access normally.

Example in VirtualHost:

<Directory "/var/www/html">
  <RequireAll>
    Require all granted

    # IP/CIDR block (examples)
    Require not ip 132.196.86.0/24
    Require not ip 52.230.152.0/24
    Require not ip 3.224.62.45/32
  </RequireAll>
</Directory>

Example in .htaccess (requires AllowOverride to be enabled):

<RequireAll>
  Require all granted
  Require not ip 132.196.86.0/24
  Require not ip 52.230.152.0/24
</RequireAll>

Helper script (bot-ip-ranges.sh)

To make the work easier, I created bot-ip-ranges.sh, a script designed specifically for this series: it downloads and normalizes IP ranges of bots from official sources, so you can integrate them into your configuration without doing it by hand.

Repository: https://github.com/Len4m/bot-ip-ranges.sh

Quick command to download, grant permissions, and run (returns the list of IPs):

curl -fsSL https://raw.githubusercontent.com/Len4m/bot-ip-ranges.sh/main/bot-ip-ranges.sh -o /tmp/bot-ip-ranges.sh && chmod +x /tmp/bot-ip-ranges.sh && /tmp/bot-ip-ranges.sh

Real cost of total blocking

Strategy 2: intermediate block (stay visible)

The key is to differentiate between search bots (allow) and training bots (block), applying different rules to each type in order to achieve visibility in AI results without facilitating training with your content.

This way, you can keep appearing in AI Search without all your content ending up in training datasets.

1) Bot control (search vs training)

More and more providers separate bots:

An example robots.txt with this philosophy (adjust it to the live list of ai-robots-txt):

# Training (block)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

# Search (allow, if you want visibility)
User-agent: OAI-SearchBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

# User-initiated requests (optional)
# (These agents are usually better handled with rate limiting and/or IP allowlist)
User-agent: ChatGPT-User
Allow: /

User-agent: Claude-User
Allow: /

2) Route control (teaser vs full content)

The cleanest strategy (if you can adapt it) is to separate routes:

Then:

Example:

# AI bots: only AI content
User-agent: GPTBot
Disallow: /blog/
Allow: /ia-content/

User-agent: ClaudeBot
Disallow: /blog/
Allow: /ia-content/

# Search bot (option A: allow everything)
User-agent: OAI-SearchBot
Allow: /

# Search bot (option B: allow only AI content)
# User-agent: OAI-SearchBot
# Disallow: /blog/
# Allow: /ia-content/

User-agent: *
Allow: /

In addition to User-Agent control, you can also apply IP blocking. As we saw in previous sections, there are public JSON lists with the IP ranges of the main AI bots. We already saw this in IP blocking (real enforcement). You can filter those ranges to obtain only the IPs of training bots (excluding search and user bots) and thus block only those you really do not want to access.

To make this filtering easier, you can use the script mentioned earlier:

https://github.com/Len4m/bot-ip-ranges.sh

For example, to get only the IPs of training bots, run:

./bot-ip-ranges.sh --exclude-search --exclude-user

That gives you a list to block directly from Nginx, Apache, or your WAF, keeping you visible to search AI and users but not for training.

Show alternative content to bots

In Part 3 I go into detail on how to adapt that alternative content to minimize tokens (what to include, what to omit, and how to structure it without losing context). Here I limit myself to two implementation patterns. They are not mutually exclusive.

Prepare alternative content

You create an alternative version per article (plain text, Markdown, or ultra-minimal format). Ideally:

Example of minimal TOON representation (it can be plain text, json, or Markdown) for this same article, optimized for LLMs and low cost:

id: limitar-acceso-ia-parte-2
url: https://len4m.github.io/es/posts/limitar-acceso-ia-contenido-sin-desaparecer-parte-2.html
lang: es
title: Limitar acceso IA a tu contenido (Parte 2 - Estrategias técnicas)
summary: Bloqueo total, bloqueo intermedio y contenido alternativo para bots.
points: bloqueo selectivo, contenido alternativo, robots.txt
tags: ai, web, robots, ia-limitation
updated: 2026-01-17
cta: Lee el contenido completo en la URL canónica.

Create an alternative file with the .toon format (for example, limitar-acceso-ia-contenido-sin-desaparecer-parte-2.toon), although it could also be plain text, Markdown, or JSON, and link it both in llms.txt and in the HTML using <link rel="alternate" type="text/plain" href="/ia-content/limitar-acceso-ia-contenido-sin-desaparecer-parte-2.toon">.

I talk about all of this (how to design and compact alternative content, further reduce exposed tokens, …) in depth in Part 3: More token reduction.

Same URL, different response by User-Agent

If you do not want to create new routes, you can serve something else when you detect a bot.

Nginx example (conceptual)

  1. Detect bots by User-Agent (ideally relying on a list like ai-robots-txt):
map $http_user_agent $is_ai_bot {
    default 0;
    ~*(GPTBot|ClaudeBot|ChatGPT-User|Claude-User|OAI-SearchBot|Claude-SearchBot) 1;
}
  1. On article routes, if it is a bot, try to serve a .toon file (or txt, json, md):
location ^~ /blog/ {
    # Humans (default): full content
    try_files $uri $uri/ /index.html;

    # Bots: serve .toon if it exists
    if ($is_ai_bot) {
        rewrite ^/blog/(.*)$ /ia-content/$1.toon break;
    }
}

location ^~ /ia-content/ {
    default_type text/plain;
    try_files $uri =404;
}

Nginx example (advanced: User-Agent + IP)

If you want more reliability, you can combine User-Agent and IP ranges (see IP blocking (real enforcement)):

# UA bot detection
map $http_user_agent $is_ai_bot {
    default 0;
    ~*(GPTBot|ClaudeBot|ChatGPT-User|Claude-User|OAI-SearchBot|Claude-SearchBot) 1;
}

# IP ranges (generated list with "1;" values)
geo $is_ai_ip {
    default 0;
    include /etc/nginx/ai_bot_ips.conf;
}

# Require UA or IP to serve alternate content
map "$is_ai_bot:$is_ai_ip" $serve_ai_alt {
    default 0;
    "1:0" 1;
    "0:1" 1;
    "1:1" 1;
}

location ^~ /blog/ {
    try_files $uri $uri/ /index.html;

    if ($serve_ai_alt) {
        rewrite ^/blog/(.*)$ /ia-content/$1.toon break;
    }
}

location ^~ /ia-content/ {
    default_type text/plain;
    try_files $uri =404;
}

Important points:

Maintenance and monitoring (must-have)

Regardless of the strategy you choose (total block, intermediate block, or alternative content), maintenance and monitoring are essential for measures to remain effective:

And I return to the idea from the beginning: the goal is not “to win a war”. It is to decide what you show, to whom, and at what cost.

To make the process easier, I created the script bot-ip-ranges.sh, which I used while writing this article to run tests. It is verified (as of the date of writing this article) and greatly simplifies blocking bots by IP compared to doing it manually. Check the repository for instructions and examples.

If you want to squeeze token reduction even more, continue with Part 3.

References and resources