Combating Referer Spam

Recently I noticed that my server stats page (generated using webalizer) contained two porn sites in the referer list as well as incest being the top search term for my website. Looking into the logs I found out that there was a ton of access from some an IP corresponding to some ISP (I think - the webpage didn't really explain what the company did). In any case, a bit of googling turned up the phenonemon of referer spam. Since these accesses were screwing up my stats and filling up my logs something needed to be done.

The simple solution is to block access that have specific referer tags. Alternatively, block the IP's that are making the search requests that are clearly spam related. To do this I used Apache's mod_rewrite engine. Essentially, mod_rewrite will look at an incoming request, see if some part of the request (source IP, referer, destination path) matches a user defined regular expression (1, 2, 3) and if so rewrite the requested URL. Very very powerful! So I wanted to block a request that had a referer of http://www.123-123-1221121.com. This is a made up referer since I don't want the real scumbag's referer to be indexed by Google. To do this I added the following to my httpd.conf file

<IfModule mod_rewrite.c>
  RewriteEngine   on
  RewriteCond     %{HTTP_REFERER}   ^http://www.123-123-1221121.com.* [NC,OR]
  RewriteCond     %{HTTP_REFERER}   ^.*(pattern1|pattern2|pattern3).* [NC,OR]
  RewriteCond     %{REMOTE_ADDR}    "^38.113.204.238" [NC]
  RewriteRule     ^(.*)     http://%{REMOTE_ADDR}/      [L,E=dontlog:1]
</IfModule>
For the first pattern I specified two flags. The NC flag indicates that the matching should ignore case. The OR flag indicates that if this pattern or any subsequent patterns match, the URL should be rewritten. The second regex shows how you can make a list of keywords , which if present in the referer indicate a spammer. Finally, the actual rewrite rule says that we should return the remote IP as the request - essentially, redirecting them to their own IP address. For this line the flags indicate that this is the last rewrite rule (L) and that the dontlog environment variable should be set so that these requests are not logged. The dontlog variable can really be called anything. You just need to ensure that the log file specification indicates that if the variable is set do not log the request. So in my config file I have
CustomLog logs/access_log combined env=!dontlog
A good example of using this approach is described here. The problem with this approach is that I need to regularly check my logs to look for spam related entries. Furthermore, as the number of abusers increase the number of RewriteCond terms are going to increase, thus stressing the server. What would be nice is some sort of automated and distributed method to block of refer spam, robots and so on.

In any case, you can get my current set of RewriteCond's that I use here.

Some other discussions of this problem can be found here and here. A useful tool to check that the regexes to catch referer spam are correct. At the moment it doesn't seem to be working. So you can also use this command line Python script to spoof User-Agent or Referer headers to test that your regexes are working. Usage is

python refcheck.py -h www.someserver.com -r http://www.somereferer.com \
        -u "A User Agent"
More info for this program is available here.