Attack of the Bots
Sounding like the plot line of a low-budget sci-fi adventure movie, the internet is under attack by hordes of bad bots. The attack has been underway for years, but it’s noticeably worsened just over the last few months. These automated crawlers pay no attention to the established robot exclusion protocol (so your robots.txt file is of no help), and when they hit, they hit hard: you may see your site traffic jump by a factor of 3x or more as the bad bots sink their greedy teeth into your site. They may hit your site with literally thousands of page requests in a couple of minutes. Why? Bad bots serve a number of purposes, some of which include:
- Scraping for content
- Gathering commercial intelligence
- Checking for copyright violations
- Building subject-specific search databases for the purpose of generating spammy lists of links
- Advertising sites in server logs, where webmasters will find them
With others, who knows what they’re doing? The bottom line for you, as a site owner, is that many of these bad bots do nothing of any value whatsoever for you.
In fact, it’s not just that they do nothing of value for you. They do plenty that will cost you. More non-human traffic from bad bots means lower apparent conversion rates, higher bandwidth bills, and a lack of reliable tracking of the most fundamental measure of site performance — page views. I know of several colleagues who boast some pretty darned impressive traffic numbers but who have no idea what I’m talking about when I ask which bots they regularly exclude from their site. The trouble is, unless you are taking some very specific steps to handle bot activity, it’s likely you actually have no idea at all how much genuine traffic your site receives!
To make matters worse, the bigger and more well-positioned in the search engines your site is, the more significant the impact of bad bots becomes.
Risk Factors for Bad Bot Hits
From our experience, it’s been easy to identify several risk factors that can make your bad bot experience even worse:
- High numbers of pages
- Because bad bots have a greater appetite for pages than regular human visitors, the bot problem is worst if your site has a high number of pages. The most obnoxious bots will hit every single page on your entire site, sometimes more than once. If your site has only 10 pages, this might barely register as a blip — a normal human visitor could easily rack up 10 or 20 or 30 page views on a single visit. On the other hand, if your site has 10,000 pages, the impact could be comparatively huge — no normal human visitor will rack up 10 or 20 or 30 thousand page views on a single visit.
- Good search engine positioning
- If your site has a high profile in the search engines for keywords that bot-wielding fiends covet, your site is more likely to be discovered and crawled for untoward purposes like content scraping or spammy link directory building.
- Hosting far up the food chain
- Most serious affiliates host multiple websites either via a reseller account, or on a virtual private server or dedicated server. The farther up the food chain you buy your hosting, the more you are on your own for handling server-level and network-level maintenance tasks, and the less likely it is that someone else is going to be dedicating time every single week to keep up with the bad bot problem.
What Can You Do to Exclude Bad Bots?
To my knowledge, the only reliable way to exclude bad bots from your site as an ordinary user (as distinct from those who run datacenters) is at the level of the server software. For Apache, that means mod_rewrite rules either in your site’s root level .htaccess file or in httpd.conf. The advantage of using .htaccess is that you can do it yourself, individually for any site, while the advantage of the latter is that your rewrite rules will run more quickly (httpd.conf is compiled, rather than interpreted, as .htaccess is) and will apply across all sites hosted on that server.
Excluding Bad Bots Yourself
If you want to handle bad bots yourself, be warned that the bot population mutates on a weekly basis — to keep up the battle, you’ll need to pay attention to your server logs and potentially modify your mod_rewrite rules at least once a week. Before you commit to that sort of a battle, it’s obviously worth examining your server logs in some detail and trying to get some estimate of the severity of the problem you’re facing. Also compare your site against the risk factors described above to get a feel for how much worse the problem might become.
If you decide to go ahead with it, the single best resource I know for crafting mod_rewrite rules for your site is a multi-year discussion on excluding bad bots over at WebMasterWorld. Here’s an example of what you’ll be after, a partial set of rules taken from one of my own .htaccess files:
RewriteEngine On
# Address harvesters
RewriteCond %{HTTP_USER_AGENT} ^(autoemailspider|ExtractorPro) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^E?Mail.?(Collect|Harvest|Magnet|Reaper|Siphon|Sweeper|Wolf) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (DTS.?Agent|Email.?Extrac) [NC,OR]
RewriteCond %{HTTP_REFERER} iaea\.org [NC,OR]
# Download managers
RewriteCond %{HTTP_USER_AGENT} ^(Alligator|DA.?[0-9]|DC\-Sakura|Download.?
(Demon|Express|Master|Wonder)|FileHound) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Flash|Leech)Get [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Fresh|Lightning|Mass|Real|Smart|Speed|Star).?
Download(er)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Gamespy|Go!Zilla|iGetter|JetCar|Net(Ants|Pumper)
|SiteSnagger|Teleport.?Pro|WebReaper) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(My)?GetRight [NC,OR]
# Image-grabbers
RewriteCond %{HTTP_USER_AGENT} ^(AcoiRobot|FlickBot|webcollage) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Express|Mister|Web).?(Web|Pix|Image).?
(Pictures|Collector)? [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image.?(fetch|Stripper|Sucker) [NC,OR]
# Gray-hats
RewriteCond %{HTTP_USER_AGENT} ^(Atomz|BlackWidow|BlogBot|EasyDL|
Marketwave|Sqworm|SurveyBot|Webclipping\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} (girafa\.com|gossamer\-threads\.com|
grub\-client|Netcraft|Nutch) [NC,OR]
and more… Again, this is just a part of one of my files, so do not copy this text into your own .htaccess file and expect it to work!
Getting Your Hosting Provider to Exclude Bad Bots for You
Alternatively, if you’re not prepared to dive into this battle yourself, you can try to persuade your hosting provider to do it for you at the httpd.conf level or possibly even at the network level. There are two main problems with this approach:
- It is very labor-intensive, and your web host will either know this or discover it quickly, and
- While it is easy to agree on bots that nobody likes, some bots might actually appeal to some other users in a shared hosting environment.
Therefore, your mileage may vary.
In my experience, the best way to persuade a web host to take the problem seriously is to take your observations from your server logs and quantify the extent of the resource waste for your web host. For example, if you have blips of 3x traffic due to bad bot activity, it can’t hurt to point out to your web host that this means only one third of their investment in server and bandwith capacity is actually being used by their customer, while two thirds of it is being wasted by bots. Hearing that two thirds of an investment is being wasted is usually enough to get most people’s attention!
Will the Bot Battle Ever End?
The problem of bots that don’t obey the robot exclusion protocol is unlikely to go away. For webmasters who take no steps to address the problem, their apparent traffic figures will continue to climb beyond all reason, their apparent conversions will continue to plummet (unless bots start buying things), and they will continue to have no real idea of how much living and breathing traffic their sites receive.
For webmasters who do take steps to address the problem, however, non-distorted performance metrics await — and with them, all the advantages that come with having a clue about what’s really going on with your site audience.

Bookmark and Share: