Nutch - The open source web crawler

Submitted by John on Fri, 11/02/2018 - 13:22

Apache Nutch is an open source, highly extensible web crawler I sometimes use for various purposes. These include:

Cache pre-warming before a big launch

Have a major site launch coming up that you know will see a lot of traffic? Releasing a big Drupal site to the public at first can be a nail-biting experience. Drupal performs great when you have users hitting the page cache, but if you are launching for the very first time, you may have a lot of pages that just aren't cached organically. Apache Nutch is a great way to pre-warm those caches. Just set it to crawl a few iterations, and it should hit most of your publicly accessible pages.

Crawling a site behind a VPN

If you need to crawl a site behind a VPN, maybe you are assessing the structure of a protected legacy site while in the planning phases of it's replacement? Then you can't use some of the existing services like Screaming Frog. You will need to have your own crawler that can utilize your VPN connection, so Apache Nutch is a good tool for this as well. 


And now comes the real reason for this blog post. I am currently testing Apache Nutch, and how it handles PDF files, so here is an example one. dummy.pdf.