Last night, I received a notice that one of my projects was at 70% of their monthly utilization. A bit concerning as we weren't even 70% through the month, and I had deployed my Facet Bot Blocker module to this site already. Albeit, I had not deployed Bot Blocker. So of course that's what I go to do. However, upon viewing the access logs, I'm seeing requests from this user agent string, over and over again:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/134.0.0.0 Safari/537.36
So, a relatively newer Chrome browser version (139 is the latest). This unfortunately is a big hit to what was a very successful tactic for blocking this traffic. I think I can still make a case for blocking Chrome 134 and below, but it is a harder case to make as it's now within the realm of what version some slower-to-update-their-software users may be on. What is also very odd is how it's this exact user agent string I'm seeing hit the site (and others). So you could in theory just take this whole thing, and put it into the Bot Blocker sub string feature, and just block that. How many real people could possibly ve browsing on Catalina and Chrome 134?
It may also be time to just dive into adding other features to the bot blocker module as well. If whoever is running these bots wants to be so persistent, then let's make them work for it. But this is unfortunately an inevitability I saw coming when developing these bot blocking tools. We find a tactic that works, and they just fix that, and move on. I have other ideas for ways to block traffic, but those may eventually be fixed in these bots as well. The trend I am seeing is that all these IPs appear to be coming from China. It might be time to start looking into Geo IP rate limiting.
One last thing that I will say again, people have been attributing this rise in bot traffic to AI/LLM training. But I'm not convinced of this. Other major players in AI know how to configure their AI crawler tools to respect things like robots.txt. I'm not having a problem with OpenAI or Claude, which identify themselves in their user agents. And the impact of this traffic is running up the hosting costs, and wasting the time and energy of developments teams. So I don't discount this as an attack on US infrastructure do waste the resources of the government and private companies. A death by a hundred million cuts.