Yesterday I started blocking crawlers via robots.txt. I was forced to because it started getting out of hand. I’ll observe traffic for a few days and will block them on level of HTTP server or fail2ban/iptables if it turns out that they don’t respect robots.txt, contrary to the claims of the biggest offenders.
This is mostly low-profile site. Me, my family and maybe some of my friends are doing at most 500-600 requests a day, a little more if I’m tinkering with server.
There are 2 main contributors to the increased traffic: MJ12 and SemRush. Above is part of goaccess’ report from last 14 days. It is clear that these 2 bots alone do over 11000 requests a day, which is insane for a site known by maybe 10 people on Earth. It’s nowhere near the capacity of the server, but it translates to 2 GB of transferred data every month. Wasted 2 GB, because those bots give me nothing. I don’t even think I’m serving that many pages in total, including some dynamically generated.
Cherry on top: I’m not the only one who hates these bots. Wikipedia hates MJ12 especially as well. And others too.
After few days it seems that robots.txt trick worked as things calmed down. I had to add robots.txt to every subdomain though, some disallowing all bots and some merely delaying requests (via non-standard Crawl-Delay directive). Still, there are more requests than I’d expect, but I have to evaluate those before saying anything else.