Quantcast
Channel: geek Archives - n3wjack's blog
Viewing all articles
Browse latest Browse all 17

blocking AI crawlers with robots.txt

$
0
0

You probably heard about those big nasty tech companies that crawl the internet and use anyone’s content to train their Large Language Models. They might be crawling your blog or website, without caring about any licensing or copyright restrictions you listed.

Well, you can use your robots.txt file to stop them from using your content to train their AI models. At least, if they follow proper netiquette, and actually respect your robots.txt rules. There’s no way to know, but I guess it doesn’t hurt to try.

Here’s an example of how to block a bunch of AI crawlers, including OpenAI, Facebook, Amazon, Google and a bunch more.

Add the following to your robots.txt file and upload it to the root of your website. I’m assuming you know what a robots.txt file is, but in case you don’t, you can read more about it on Wikipedia, and various other resources.

User-agent: Amazonbot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: magpie-crawler
Disallow: /
User-agent: omgili
Disallow: /
User-agent: omgilibot
Disallow: /
User-agent: PerplexityBot
Disallow: /

The post blocking AI crawlers with robots.txt appeared first on n3wjack's blog.


Viewing all articles
Browse latest Browse all 17

Trending Articles