LLM scrapers for AI are even hungry for Debian’s continuous integration “CI” data. Due to the ongoing abuse of the open web by LLM scrapers, the Debian CI infrastructure is restricting the publicly accessible data with their web server resources being hammered by bots/scrapers.
Paul Gevers on the behalf of the Debian CI team laid out some steps they needed to take in order to survive all of the scraper traffic to the ci.debian.net resource. First of all, the site is no longer publicly browseable unless you are an authenticated user. They’ve had to gate this information now to help fend off all the bot/scraping traffic. Though direct links to test log files are still permitted to help with convenience.
The other change is adding a fail2ban-based firewall to address abusive traffic patterns. This has resulted in changes being made after initially finding some legitimate Debian contributors being blocked from the Debian CI portal. They think now they have a good balance for this fail2ban firewall for not accidentally triggering for real users while keeping the LLM scrapers away.
More details on these recent headaches for the Debian CI team as a result of LLM scapers going wild on the open web can be found via this team status update.
