Los Web tracking bots that “work” for AI systems are causing not a few problems to Open Source developers Already the owners of web pages. But above all to the former, because due to the nature of their developments, they tend to share more content in their pages and repositories. With the consequent damage, because they also have less resources for infrastructure than companies that are dedicated to the development of products and services that are not open source.
That is why there are more and more Open Source software developers who tire that these bots, in search of content for the training of AI systems, click on their page and sometimes throw it down for the amount of requests they make. They also want to protect their developments.
Not a few are going to action, according to Ars Technica, and not only just indicate which pages do not want them to track in their robots.txt files, because it is useless. These files were created for search engines bots, and usually include the pages that their owners do not want to index. But the bots that track to collect content for AI usually ignore the instructions indicated in these files. Simply, everything stays.
Open source developers against AI trackers
One of the free software developers that has gone to action is xe Iaso. Last January XE described how Amazonbot stood again and again to the website of its git server, to the point of causing it to the page of the page by denial of service (DDOS). These types of servers usually house a multitude of projects with an Open Source license, with the aim that everyone who wishes can download their code or contribute to it.
It’s Bot ignored the robots.txt file Iaso, He hid behind various IP addresses, and intended to be another type of user. This is what they usually do, according to IASO: Lying, changing their data and using residential IPS as proxies.
Once on the site, they are dedicated to collecting their content until the pages fall through the request of requests, and not yet stop doing so. Click on all links again and again and open the same pages without stopping. Some are so powerful and insistent that they click several times in the same link in a single second.
Therefore, Iaso made the decision to counterattack, and for this he developed a tool he has called Anubis: an inverse proxy that Check before giving a request to a git server that behind it there is no bot. If you do not exceed the evidence to which you submit, it does not let the server access the request.
That is, it blocks the bots, but let the requests made by humans advance. If a web request exceeds the test to which it is submitted, and it is considered that it comes from a human, an anime image appears that announces success. If it is a bot, your access request is denied and you cannot access the contents of the page.
Iaso shared the Anubis project in Github a few days ago, and its popularity has spread as the gunpowder the community dedicated to Open Source software. It already has more than 2,000 stars, 20 contributors and around 40 forks.
Growing problems for developers
Iaso’s problems with these bots are not the only ones. There are dozens of cases. For example, the CEO of Sourcehut, Drew Devault, explains what happens between 20% and 100% of its working time during the week mitigating the effects of large -scale language trackers on scale. In addition, he assures that every week he has, for this reason, dozens of service interruptions of his websites.
Denis Schubert, which maintains infrastructure for the Diaspora social network, described the situation that is being lived with these bots as «An internet service denial attack«, After discovering that the companies of AI supposed 70% of all web requests to their services.
The costs of this voracity of the bots are both technical and financial. According to Schubert, the blockade of the IA trackers he made made his traffic descend by 75%, from 800 GB per day to 200 GB. This caused your project to save approximately $ 1,500 per month on bandwidth costs.
But not just that. In addition to consuming bandwidth, trackers often attack expensive points, such as git logs pages, and conmites of repositories, adding stress to already limited resources. Other Open Source projects began to receive in December 2023 bugs reports by AI, which in principle seem legitimate, but contain invented vulnerabilities, causing developers to waste a lot of time checking them at all.
Martin Owens, of the Inkscape project, stood out in Mastodon that his problems were not due to the attacks of ddos that usually suffer sometimes, but «but«of several companies that have begun to ignore our tracking configuration and have begun to lie about the information of their navigation«. This caused him to begin to develop a lock of these bots, which he describes as prodigious, and that makes «If you work for a large company that is dedicated to AI, you may not be able to access our website«.
The developer Jonathan Corvet, who also has a news website, points out in Mastodon that his page experiences slowdown often because of the traffic traffic of the AI trackers, whose level of activity compares with that of traffic that causes attacks of service denial. Kevin Fenzi, Systems Administrator of the Fedora Linux project, says that the AI tracker bots have become so aggressive that it has had to block access to all of Brazil.
There are many other similar cases. At one point, another developer had until prohibiting access to all IP addresses in China. The problem is of such magnitude that, as we see, there are developers who have had to prohibit access to their repositories to entire countries to avoid the effects of these bots.
The best defense, a good attack
To avoid this, as with Anubis, many choose to defend themselves, but there are developers who think it is better to move on to attack. A few months ago, a hacker news user named Xyzal suggested that If they loaded in the robots.txt files with junk content fileslike the benefits of drinking bleach, The search engine robots would ignore them, but these bots of AI no. That is, they have begun to have traps to the bots of AI.
So much, that there are already tools specifically dedicated to this. In January, another anonymous developer, known as Aaron, Nepenthes, software with a carnivorous plant name that is right to that. Catch the trackers in a maze of false content.
But not only developers have got to work to get rid of AI bots, Cloudflare already offers several tools to fight these. The main one is AI Labyrinth, focused on slowing down the trackers, confusing them and spending the resources of these bots and others that do not respect the non -tracking directives that are in the Robots.txt files.
Thus, the trackers who do not behave with respect with the pages creators supply irrelevant content so that it does not stay with the legitimate contents of the websites they intend to copy.
The community is also developing collaborative tools that help protect themselves from these trackers. As the project ai.robots.txtwhich offers an open list of web tracers associated with AI companies, as well as robots.txt predested and implement the robot exclusion protocol. In addition, they have hotaccess files that return error pages when they detect trashing requests of AI.
Why do IA trackers do this?
The behavior of the trackers indicates that There may be several reasons to behave in this way. Some may be collecting data to train or adjust large language models. Others may be executing real -time searches due to users’ questions to AI attendees to find information.
The frequency of these traces is particularly disturbing. Because they not only track a page once and then they go to another place. They back to track the same page several hours later. This pattern suggests that much of its activity is due to data collection, instead of training exercises in a pass. This is a possible indicator that companies are using these traces to keep the information of their models updated.
Some companies dedicated to AI, yes, are more aggressive than others. The KDE Systems Administrators team has assured that the trackers that leave Alibaba IP ranges came to drop their Gitlab repository temporarily. Iaso’s problems, as we have seen, come from the Amazon tracker.
According to a KDE team member, Western Operators of Large Language Modelslike OpenAi Anthropic, They were adopting more respectful settings for their botsthat in theory allow the websites to block them, but that other Chinese companies of ia were not so careful with them. The fact is that their activity not only casts the pages and repositories of developers around the world, but also costing money, and they have to take measures for ELO.