Site Performance and Misbehaving Bots

Submitted by John on Thu, 01/16/2020 - 23:05
Image
A close up image of a small, purple, plastic toy of a transformer robot. I just went digging through my kid's toy chest and snapped this picture on my kitchen counter. It's a bot, this article is about bots. Get it?

 

So I’ve dealt a lot with optimizing Drupal to perform the best it can. If you go and search for Drupal performance, you will find a lot of articles that describe doing the following in more or less words:

  • Enable caching
  • Views cache
  • Turn on CSS/JS aggregation
  • Keep modules up to date
  • Use a CDN
  • Image Optimization
  • Clean up broken links to minimize 404s

These are all great things to implement to improve your site performance. I’ve even posted an article on forumone.com in 2015 that basically covers this. . But when working in a complex site, serving an enterprise level client, you might need to take some additional measures. To identify what these measures might be, you will want to review your site logs as well as any profiling/analytics data. I enjoy New Relic, which comes with a Pantheon subscription.


Dealing with Misbehaving bots

While Google will start considering “nofollow” attributes in links in it’s crawling in March, a lot of companies have already followed Google’s lead, and programmed their own crawlers not to respect this directive. The advice is to set directives in your robots.txt. One problem here is that if you are using URLs that follow a logical “content-type-listing-page/individual-content-page” pattern. A directive in robots that disallows every pattern on that listing page via wildcare “content-type-listing-page/*” should prevent bots from crawling your faceted search page, but will also prevent them from crawling your actual content under these pages, which you do want. To prevent bots from specifically crawling facet links, you could also add these entries into robots.txt:

Disallow: /*?f[*
Disallow: /*&f[*
Disallow: /*?f%5B0*
Disallow: /*&f%5B1*

This specifically targets the URL parameter pattern you see in a typical Drupal Search. Even with this, you may still see some bots that don’t respect robots.txt, and we will get into how to handle those at the end of this post.

It’s not just crawlers though. A large enterprise might have some cruft laying around, such old applications that continue to make requests to your domain. In one particular case, we identified a lot of POST requests hitting the path “/autodiscover/autodiscover.xml”. These were coming from Microsoft office applications, mostly outlook, but other applications like Word, and Powerpoint (for some reason?...). It is a part of a process for autodiscovery for configuring these clients against a Microsoft Exchange server. Problem is, the organization shut down it’s Exchange server many years ago, and has since migrated to Exchange online.  Since these were POST requests, they couldn’t be cached, which was contributing to the overall performance of the site due to the time it took to generate an un-cached 404. So to address this, we just detect these types of requests early in the bootstrap process, and issue a 403 instead. We were also seeing some requests coming in   

Here is some code we came up with to address bad bots, errant autodiscovers and more. I added this to the top of Drupal's settings.php file.

function settings_quick_forbidden() {
  header("HTTP/1.1 403 Forbidden" );
  exit();
}

// Check to see if this is a bot; they were crawling our facets quite egregiously
if (isset($_SERVER['HTTP_USER_AGENT'])
&& preg_match('/bot|crawl|slurp|spider|ahref|mediapartners|Mb2345Browser|LieBaoFast|HUAWEIFRD|UCBrowser|zh-CN|MicroMessenger|zh_CN|Kinza/i', $_SERVER['HTTP_USER_AGENT'])) {
  // If we see something in the request that looks like a facet, 403 this for the bots
  if (isset($_SERVER['REQUEST_URI']) && (strstr($_SERVER['REQUEST_URI'],'?f[') || strstr($_SERVER['REQUEST_URI'],'&f[') || strstr($_SERVER['REQUEST_URI'],'?f%5B') || strstr($_SERVER['REQUEST_URI'],'&f%5B'))) {
    settings_quick_forbidden();
  }
}

// Autodiscover requests coming from Outlook/MS Office.
if (strtolower($_SERVER['REQUEST_URI']) == '/autodiscover/autodiscover.xml') {
  settings_quick_forbidden();
}

// McAfee products hitting our domain for updates. 
if($_SERVER['REQUEST_URI'] == '/EPO5-UNC$/SiteStat.xml') {
  settings_quick_forbidden();
}