When you deny LLM scraper bots

Cheese · September 6, 2024, 12:49pm

Can anyone guess when we denied LLM scraper bots access to one of our Projects...

Cheese · September 6, 2024, 1:31pm

For context purposes. All media is hosted on S3, most assets are CDN based. It seems these bots hit your sitemap initially (and man they don't stop, they stick around and repeat and repeat their scraping). This Project being a directory has thousands of entries. Essentially most of the bandwidth consumed by these bots is text based. That is one is a hell of a lot of text (the page layouts themselves don't weigh a lot at all)... These bots are a modern day pest and out of control...

sitestreet · September 6, 2024, 3:13pm

I've just had a similar issue with bots which targeted an events calendar and were relentless. We had to block them just to bring the server load back to normal levels.

Cheese · September 6, 2024, 5:11pm

This was similar the load went up dramatically (first we saw a significant increase in database activity) as well as the bandwidth which was a real sign of something happening as despite a very healthy User base we have never hit these levels even on the busiest days. Just very happy we no longer have to pay for bandwidth like we used to do. In the 'old' days we had an incredibly popular gaming website which cost us thousands of Pounds a month to host just because of bandwidth usage. Happy to say these days with unlimited bandwidth this is not an issue. Still we don't want this activity and in no way have we agreed to it, essentially it is intellectual property theft... These LLM companies really need reigning in.

Apple · September 6, 2024, 7:23pm

How did you deny the bots?

Cheese · September 6, 2024, 7:34pm

We did this using a new Cloudflare feature:

Previously we had tried using robots.txt to do this but the bots simply ignore any directives specified within it (they are a law un-to-themselves). Recently Cloudflare announced the above, enabling this had an immediate and dramatic effect, as seen in the above graph.

This setting can be found within (available in the Free Tier also):

Apple · September 6, 2024, 9:05pm

Does that block search engine crawlers?

Cheese · September 6, 2024, 9:19pm

No not at all. Site is indexed correctly with no loss in indexed/served pages.

Damn I sound like a Cloudflare pimp... They have an enhancement for that too (accessible on the Free Tier). Which works incredibly well. We see new pages indexed in around four to six hours, blog entries even faster, sometimes within an hour or two (site built entirely in Wappler).

Hyperbytes · September 8, 2024, 2:09pm

I use this in robots.txt

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Google-Other
Disallow: /

User-agent: Google-Other-image
Disallow: /

User-agent: Applebot-Extended 
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: ClaudeBot 
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: facebookexternalhit
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: ImagesiftBot 
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Meta-ExternalFetcher
Disallow: /

User-agent: Timpibot
Disallow: /

User-agent: Webzio-Extended
Disallow: /

User-agent: YouBot
Disallow: /

User-agent: img2dataset
Disallow: /

User-agent: scrapy
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: FriendlyCrawler
Disallow: /

User-agent:Ai2Bot
Disallow: /

User-agent:Ai2Bot-Dolma
Disallow: /

User-agent:ValenPublicWebCrawler
Disallow: /

ben · September 8, 2024, 11:56pm

An update nightmare every time a new bot appears/disappears

Hyperbytes · September 8, 2024, 11:57pm

Nobody said life as a webdev was easy

Cheese · September 9, 2024, 12:05pm