Written by Paul Bourke
April 2001
There are a number of software packages, both freeware and commercial, that will automatically copy all the pages and associated files from a remote WWW server. These are often called "offline browsers", the user who finds a site or group of pages they like makes a local copy for later exploration. There are some legitimate reasons why some people want to do this, as well as some not so legitimate ones. The most commonly quoted reason is perhaps related to internet connectivity limitations, instead of staying online and browsing the site in "human time", the computer can quickly copy the site or group of pages which can then be browsed at the users leasure without being online.
There are some cases however where the content creator does not want this to happen. Some reasons are given below, they mostly apply to large sites and most are based upon the most likely fact that only a small percentage of the downloaded files will ever be looked at.
Many people and organisations pay for traffic (or traffic level) on their internet connection. Most users of offline browsers leave them running without supervision, for sites with a large number of files this can result in significant download volumes.
Many WWW servers have a limited bandwidth, the last thing they want is the bandwidth to be used for mindless copying. Bandwidth is often calculated and paid for on an average basis, large peaks can have a significant adverse effect on other users browsing at the same time.
Many sites are designed to be experienced online and may even be interactive. These sites are meaningless when accessed locally, this is especially so if they contain server side interactivity.
There are sites with online databases, while these might be provided inline, it may be desirable to limit the ease at which users could make a copy of the whole database.
The following, which relies on the Apache WWW server, is a straightforward and reliable way of stopping commonly used offline browsers from copying a site. The following, after being modified for your local site, should be placed in the .htaccess file at the directory it is intended to protect.
RewriteEngine on
# (testing purposes) RewriteCond %{HTTP_USER_AGENT} ^Mozilla* [OR]
RewriteCond %{HTTP_USER_AGENT} ^FAST\-WebCrawler* [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Dart* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Pockey* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetMechanic* [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot* [OR]
RewriteCond %{HTTP_USER_AGENT} ^QRVA* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebMiner* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebDownloader* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Downloader* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebMirror* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Anarchie* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Down* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Slurp* [OR]
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget* [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebHook* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Scooter* [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport*
RewriteRule ^.*$ /pbourke/errors/robots.html [L]
Note
The first line above preceeded with a "#" is intended for testing purposes given that I using Netscape. Remove the comment to make sure the filtering works.
The last line should point to a document on your local site that informs the user why they can't copy your site. At the time of writing my page looked like this: robots.html.
The robots.html file must of course be OUTSIDE the directory that has been protected or they can never receive it.
This method does rely on these offline browsers reporting their USER_AGENT correctly.
And finally, this document is not intended to imply that these offline browsers are inherently undesirable. There are however circumstances where their behavior is undesirable.