Internet Security - Blocking Bots from harvesting your website using the 'user agent' to selectively filter unwanted information collection automation.
marketing space for rent

Introduction

Not all bots are made alike. Basically bots or robots are automated programs to search the web for information. There are no humans involved and these processes can proces and consume large amounts of data. Typically bots are used to:

  • Gather webpage information for indexing and processing by search engines
  • Scan web pages for email addresses, used by spammers
  • Scan web sites for open-source and free software/plugins with known vulnarabilites, used by hackers.
  • Scan web site content for analytics, used by SEO companies.

In all four cases above, only one is interesting for you and your website, the rest should be completely avoided. But not all search engines are created alike either.

Information gathering by search engines should be respectful and of benefit to all parties, information supplier, information indezer and the people searching for information. The question from the perspective of a website owner should be. How often does the bot visit, how much information do they consume and most importantly, how many referals do I get from this 'search engine'

Of the hundreds of (bona fida) search engines, 95% never send any traffic. Search engines with a healthy consumption/referal (take/give) index I can count on one hand.

Although the worst automation tries to hide behind 'standard user browser' user agents, a lot of unneccessary traffic can be avoided by blocking access to your sites by using user-agent filtering.

Bot UA Blocking Botlist

Spider Wasp

First we need to store and retrieve which User Agent Strings we want to scan for this purpose we'll use an editable XML file which is read when the website is started, or the XML file is mutated and kept as a fast memory string array. Updated Actual BotList XML available here.

<?xml version="1.0" encoding="utf-8" ?>

<root>

  <bot>Add Catalog</bot>

  <bot>admantx</bot>

  <bot>Baiduspider</bot>

  <bot>bigdatacorp</bot>

  <bot>CATExplorador</bot>

  <bot>coccoc</bot>

  <bot>EMail Exractor</bot>

  <bot>Ezooms</bot>

  <bot>GrapeshotCrawler</bot>

  <bot>IstellaBot</bot>

  <bot>loadedweb</bot>

  <bot>Mail.RU</bot>

  <bot>majestic</bot>

........

Bot UA Filter Class

using System;

using System.Collections.Generic;

using System.Xml;

using System.Linq;

using System.Xml.Linq;

using System.Text;

using System.Runtime.Caching;

 

namespace JeroenSteeman

 

{

    class BotChecker

    {

        string[] storedBots;

        MyCache objCache = new MyCache();

 

        public bool DoesBotExist(string User_Agent)

        {

            if (!objCache.MyCacheContains("bots"))

            {

                storedBots = GetStringArray(AppDomain.CurrentDomain.BaseDirectory + "BotList.xml");

                List<String> lstFiles = new List<string>();

                lstFiles.Add(AppDomain.CurrentDomain.BaseDirectory + "BotList.xml");

                objCache.AddToMyCache("bots",storedBots,MyCachePriority.Default,lstFiles);

            }

            else

            {

                storedBots = (string[])objCache.GetMyCachedItem("bots");

            }

 

            foreach (string bot in storedBots)

            {

                if (User_Agent.Contains(bot))

                {

                    return true;

                }

            }

            return false;

        }

 

        public string[] GetStringArray(string path)

        {

            var doc = XDocument.Load(path);

            // Select all bot entries

            var services = from service in doc.Descendants("bot")

                           select (string)service.Value;

            return services.ToArray();

        }

    }

 

    public enum MyCachePriority

    {

        Default,

        NotRemovable

    }

 

    public class MyCache

    {

        // Get a reference to default MemoryCache instance.

        private static ObjectCache cache = MemoryCache.Default;

        private CacheItemPolicy policy = null;

        private CacheEntryRemovedCallback callback = null;

 

        public void AddToMyCache(String CacheKeyName, Object CacheItem, MyCachePriority MyCacheItemPriority, List<String> FilePath)

        { 

            policy = new CacheItemPolicy();

            policy.Priority = (MyCacheItemPriority == MyCachePriority.Default) ?  CacheItemPriority.Default : CacheItemPriority.NotRemovable;

            policy.AbsoluteExpiration = DateTimeOffset.Now.AddDays(1);

            policy.RemovedCallback = callback; // not implemented

            policy.ChangeMonitors.Add(new HostFileChangeMonitor(FilePath));

 

            // Add inside cache

            cache.Set(CacheKeyName, CacheItem, policy);

            }

 

            public Object GetMyCachedItem(String CacheKeyName)

            {

            return cache[CacheKeyName] as Object;

            }

 

            public void RemoveMyCachedItem(String CacheKeyName)

            {

            if (cache.Contains(CacheKeyName))

            {

                cache.Remove(CacheKeyName);

            }

        }

        public Boolean MyCacheContains(String CacheKeyName)

        {

            if (cache.Contains(CacheKeyName))

            {

                return true;

            }

            else

            {

                return false;

            }

        }

    }

}

 

How the Botlist works

As from ASP.net 3.5 application cache in memory could be used anywhere and anytime in your web process, but could not be used with WPF nor desktop application development. Since .net 4.0 things have changed with the inclusion of System.Runtime.Caching.

Make sure to include it as a reference if you're using VS2010 (like I do here).

How it works

BotChecker is set up as a class library (not complete in this example). It can be implemented in any application, from websites (Forms and WCF), WPF as well as Desktop applications. In this case the purpose is to check whether we have a list of bots to compare too, if not load it and else use the cached string array as comparator for our inbound user agent identification strings.

Using an editable XML file as source, LINQ is used to extract the keywords used to check against the incoming UA strings from website requests.

The main method is DoesBotExist and as argument the user agent string from the incoming http request.

False = Could not find any part of the user agent string that matches any entry in the cached array of trigger strings to compare to.

True = An entry in the botlist.xml was found in the incoming UA string of the request. Depending on how diverse and distributed the list is, you could consider a host of responses.

To process and dump these requests fast, just return a HTTP Status Code and close the connection.

  • 401 - Unauthorized (but this can be confusing suggesting an alternative entry point)
  • 402 - Payment required. My personal favorite. If you want to (ab)use my information for your own profit, then payment is required.
  • 410 - Gone. Maybe they will believe it and not return. My findings are all but the case.
  • 418 - "I'm a teapot' Doubt this will have effect as most low-to-no cost hosters, wanna be cloud operators and their websites are just that! And bound to appear on your radar as zombies in the near future with a lot worse intent in mind.
  • Alternative - Redirect (301) to an 'unwelcome' page indicating to the client your do not appreciate them connecting and extracting information from your resource.

How to use the BotChecker

Make a class in your project or website and insert the BotChecher code. Then in your operational environment (Global.asax for ASP websites) insert the following to activate the fast cached bot lookup method.

// At any place, like Application_BeginRequest in the Global.asax file for web sites, or InitializeComponent for applications

            BotChecker BC = new BotChecker();

 

To test the user agent string against your list. (for websites)

   if (BC.DoesBotExist(Request.UserAgent))

            {

               // Redirect or set header and close connection

            }

The above snippet in the Global.asax will allow you to intercept and deal with requests before they hit your site. This is an inline process.

If all you want to do is log their activity a better solution would be use a separate thread and not delay the requst while you do the check.

Implement the routine as a 'fire and forget' bot logger.

Don't waist time waiting for the logger to check and potentially log bot access, just call the routine with the complete user agent string and pass control back to the web application. It will run in the background and do its work.

 // Include this namespace

            using System.Threading;

 

            // In Global.asax 'Application_BeginRequest'

            Thread MyThread = default(Thread);

            BotChecker BC = new BotChecker();

            MyThread = new Thread(new ParameterizedThreadStart(BC.DoesBotExist));

            object MyParameters = Request.UserAgent;

            MyThread.Start(MyParameters);

            // In the DoesBotExistroutine, write the result (if any) to a log file for later analysis.

            // You can pass in more parameters, like IP address for registration purposes.

User Agent Tracking Considerations

Fact: Bots have become smart, bots have embraced cloud computing long before mortals knew it even existed. Bots hide in plain sight trying to appear like normal human visitors.

With this additional clutter of noise and bots hiding behind 'human browser' user agents, black and white becomes gray and the processes to block bad bots by using their 'user agent string', although presently still rather effective, will become less so in time as they all try to act like and look like anonymous web users.

A better way to track and trace activity (from a receivers perspective) is to use behavior pattern technology. Bots can only go so far to hide what they are are and what they are after. It is thus rather easy to detect these abstracts between normal real user activity on port 80. In future articles I will go into how to detect these 'wolf bots' in 'anonymous user sheeps clothing'. They are very resourcefull in their methods but all fall short of many things that typically dipict us as 'human users online'.

Fully functional VB.net 'SpiderWasp" implementation as a HTTP 403 (payment required) feeder for detecting and stopping known unwanted spiders from crawling your web. For PHP and C# versions - contact me.