Combating Referer Spam
The simple solution is to block access that have specific referer tags. Alternatively, block the IP's that are making the search requests that are clearly spam related. To do this I used Apache's mod_rewrite engine. Essentially, mod_rewrite will look at an incoming request, see if some part of the request (source IP, referer, destination path) matches a user defined regular expression (1, 2, 3) and if so rewrite the requested URL. Very very powerful! So I wanted to block a request that had a referer of http://www.123-123-1221121.com. This is a made up referer since I don't want the real scumbag's referer to be indexed by Google. To do this I added the following to my httpd.conf file
<IfModule mod_rewrite.c>
RewriteEngine on
RewriteCond %{HTTP_REFERER} ^http://www.123-123-1221121.com.* [NC,OR]
RewriteCond %{HTTP_REFERER} ^.*(pattern1|pattern2|pattern3).* [NC,OR]
RewriteCond %{REMOTE_ADDR} "^38.113.204.238" [NC]
RewriteRule ^(.*) http://%{REMOTE_ADDR}/ [L,E=dontlog:1]
</IfModule>
For the first pattern I specified
two flags. The NC flag indicates that the matching should ignore case. The
OR flag indicates that if this pattern or any subsequent patterns match, the
URL should be rewritten. The second regex shows how you can make a list of keywords
, which if present in the referer indicate a spammer. Finally, the actual rewrite rule says that we
should return the remote IP as the request - essentially, redirecting them
to their own IP address. For this line the flags indicate that this is the
last rewrite rule (L) and that the dontlog environment variable
should be set so that these requests are not logged. The dontlog variable
can really be called anything. You just need to ensure that the log file
specification indicates that if the variable is set do not log the request.
So in my config file I have
CustomLog logs/access_log combined env=!dontlogA good example of using this approach is described here. The problem with this approach is that I need to regularly check my logs to look for spam related entries. Furthermore, as the number of abusers increase the number of RewriteCond terms are going to increase, thus stressing the server. What would be nice is some sort of automated and distributed method to block of refer spam, robots and so on.
In any case, you can get my current set of RewriteCond's that I use here.
Some other discussions of this problem can be found here and here. A useful tool to check that the regexes to catch referer spam are correct. At the moment it doesn't seem to be working. So you can also use this command line Python script to spoof User-Agent or Referer headers to test that your regexes are working. Usage is
python refcheck.py -h www.someserver.com -r http://www.somereferer.com \
-u "A User Agent" More info for
this program is available here.