Endeavor to detect and counteract rampant web spiders and crawlers in Apache

Locutus · December 27, 2012, 12:03pm

Hey Virtualminers!

Recently my Virtualmin system was “victim” of a few “web crawler attacks”, i.e. bots or users scraping full websites at a much too high pace. The worst incident was a spider performing 15+ requests per second of a PHP-laden website, from several dozen IP addresses from two /27 subnets, totaling 14.500+ requests within 20 minutes, hogging the system’s CPU and triggering high load alerts.

So I’m now looking for ways to automatically react to such incidents, instead of having to manually block IP addresses in firewalls when I get overload alerts. Of course I don’t want to adversely affect regular users in the process.

Basically, I’m looking for a way to counteract mild cases of (D)DoS-attacks on websites.

I pondered a few things… using an Apache module like mod_evasive or mod_qos to regulate page requests, using iptables to rate-limit connections, or using a log analyzer like fail2ban to block IPs that fetch too many pages too quickly.

All of those have some disadvantages. My current idea would be somewhat of a combination: Analyzing Apache logs, detecting rapid and uniform requests of many different pages by one IP or a set of similar IPs, and if found, create temporary iptables rules to apply severe rate limiting for the offending IPs.

The goal is to leave regular users unscathed, and severely slow down (not completely block) rampant spiders or crawlers.

My question: Does anyone know of (a combination of) software that can already do the trick? Or would anyone be willing to participate in an endeavor to create a set of scripts to accomplish the goal? Might this (“Rampant Spider Protection”) even be something the Virtualmin team could be interested in for inclusion in their software?

tpnsolutions · December 28, 2012, 2:21am

Locutus,

== CLOUDFLARE ==

One of the things we’ve recently discovered to be useful for this and many other types of attacks (as we’ll call it for the moment), is making use of “CloudFlare”. Their service basically acts as a middle-man, and will prevent DOS and DDOS attacks amongst other types of incidents.

== ROBOTS.TXT ==

You should read up and deploy a “robots.txt” file at the root of your websites (http://www.robotstxt.org).

== OSSEC ==

Aside from that, a well trained “intrusion detection system” will help mitigate such attacks. We used OSSEC for our systems, which has been super helpful in reducing malicious activity.

== FIREWALL ==

Finally, if you find a few frequent spiders hitting your site, and can confirm that they are using a block of IPs, you could easily block all their IPs by range in the firewall.

-Peter

Locutus · December 28, 2012, 10:28am

Thanks for your input!

I don’t wish to use a third-party service like Cloudflare for my purposes. Robots.txt doesn’t help if the bot to be slowed down doesn’t adhere to it.

Can OSSEC have rules like “detect accesses to DIFFERENT URLs, over an extended period, with a steady flow and high per-second rate”?

As I wrote, I know that I can block spiders in my firewall, that’s what I’m doing now. But that’s just reacting… I want to be proactive and block the stuff while it happens, not skim through logs when I receive “CPU overload” warnings.