Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

On improving Perlmonks site availability

by Co-Rion (Pilgrim)
on Mar 25, 2026 at 13:30 UTC ( [id://11167470]=perlmeditation: print w/replies, xml ) Need Help??

Since at least 6 months, the site suffers greatly from residential proxies scraping the site, most likely gathering the data for AI training. This makes the site intermittently unavailable for humans.

Plans for mitigating the situation

Moving the anonymous parts of the site behind a CDN

Moving the anonymous parts of the site behind a CDN protects the machine(s) hosting the site from random invalid requests and also provides a larger cache than we have to the bots feeding the AI. This is not ideal, as I dislike adding intermediate services, but it is better to have an accessible site than to have no accessible site.

The CDN should be mildly smart in the sense that we want to prevent whole classes of (invalid) requests from hitting the backend at all:

  • requests matching for example qr!^/www.cpan.org/!
  • requests not matching qr!^/index.pl?.*!
  • ... many more old / outdated / never valid URLs that the bots have ingested and mindlessly scrape

Moving the site behind a CDN would mildly imply changing some settings of Anonymous Monk. Especially the short-lived parts of a page like the chatterbox, the CPAN nodelet and some other nodelets would not be shown anymore. Depending on the CDN setup, the nodelets could potentially be included dynamically.

Options for CDNs

  • Fastly - I think they are Perl-adjacent. People have mentioned them as competent and good to work with.
  • CloudFlare - I think I understand their setup. I don't like that they want to terminate SSL on their end, which to me mildly implies generating a fresh certificate and arranging the certificate to be shared between Pair and CloudFlare.

Moving the logged-in parts of the site to another machine

Having logged-in users access the site via (say) user.perlmonks.org on a different machine+IP address will ideally prevent bots from clogging the access lane of logged-in users, provided that the URL does not leak to the vibe-coded scrapers too much. That different machine could also be far more aggressive with its CDN/firewall/whatever rules and outright reject all requests that do not contain a valid session cookie.

This second machine might or might not be necessary, depending on whether the CDN already takes the brunt of bots and we can use the existing webserver machine for that. A separate machine allows us more aggressive configuration of that machine with respect to the expected kind of requests.

How can I help?

The site is currently hosted by pair.com on managed Apache using mod_perl on one machine and a second machine hosting a managed MySQL database. If your contribution does not keep these parts, think really hard about whether your approach is tenable before posting it.

Please refrain from offering solutions unless you have proven experience with the Everything engine and integrating your solution with it. Also refrain from posting your thoughts / brainstorming under the expectation that anybody reads it and responds to it unless you have applicable and actionable points to contribute.

What helps us:

  • concrete contacts and experience with a CDN and how it could work with our setup
  • concrete cost indications for a CDN setup (setup, bandwidth, number of requests, ...)
  • sponsoring contacts

What does not help is:

  • Adding some random webserver program that "has better performance". The main problem is exhaustion of TCP connections as the host cannot close and recycle the rejected TCP connections fast enough.
  • Blocking "the bots" / more IPs / ASNs. Residential proxies are distributed in terms of IP addresses.
  • Suggestions like "Just add them to  robots.txt"
  • Adding something like Anubis in front of Apache. This will be very difficult with the current managed setup with Pair. Also, while this takes some load of the actual webserver, it doesn't change anything in terms of TCP connections.
  • rewriting the site to move away from the Everything engine. While that might solve the problems and the Everything engine is not great, I'm not interested in spending time on a long-running effort that might not go anywhere once you lose interest.
  • Move Things To The Cloud (AWS, GCP, Azure, whatever) - while this might scale the responses, the costs of this site would explode, as none of the cloud hosters offer a sane package with defined upper cost limits.

Replies are listed 'Best First'.
Re: On improving Perlmonks site availability
by dissident (Beadle) on Mar 30, 2026 at 21:44 UTC
    I havent seen this idea in your list, no idea whether it is viable:
    What about providing a zipped download of the whole site, updated weekly or so?
    Maybe stripped down from unnecessary/redundant info, so it is actually easier to use the download than scraping?
      If there was any "intelligence" behind the scraping, it would have long realized that we have only dozens of posts with new content per week.

      The AI business is hungry and greedily tries to swallow as much content as possible and many actors are throwing billions into training more bots.

      (Update: Moved the rest to Redirecting anonymous requests to offline copy on other servers)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      see Wikisyntax for the Monastery

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://11167470]
Approved by ikegami
Front-paged by ikegami
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others chilling in the Monastery: (2)
As of 2026-04-16 01:13 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?
    hippoepoptai's answer Re: how do I set a cookie and redirect was blessed by hippo!
    erzuuliAnonymous Monks are no longer allowed to use Super Search, due to an excessive use of this resource by robots.