That there are academic articles written by AI is something that has been proven before, the question is how serious it is. To find out the magnitude of this practice, a group of researchers has reviewed millions of summaries of papers published in PubMed and they have found something interesting: there is a word that AI loves and the reason it likes it so much is quite murky.
Delve. Its translation is ‘go deeper’ and its use multiplied by 28 between 2022 and 2024, which coincidentally coincides with the boom of ChatGPT and language models. Other words such as ‘underscore’ or ‘showcasing’ are also cited, with a frequency increase of x13.8 and x10.7 respectively. None of them are a noun or a word related to the content, but rather have more to do with the style of writing and are very characteristic of the flowery language that LLMs usually use.
flowery language. Does this mean that if we see one of these words in a paper it was written with AI? Not necessarily, but the increase is brutal. Researchers have compared the rise of ‘delve’ to other keywords, such as pandemic, which had a huge peak in 2020 and began to decline in 2021. The increase in the frequency of use of ‘delve’ is much more pronounced than all the others.
It’s not coincidental. There is a stage in the process of creating a chatbot like ChatGPT that requires human intervention to fine-tune the responses; This is what is known as reinforcement learning from human feedback (RLHF). It turns out that most of the workers who are dedicated to this refining work are in African countries, such as Nigeria. Guess where the use of these words in formal English is quite common. Exactly, in Nigeria.
African style. ‘Delve’ is a fairly common word in business English in Africa, especially in Nigeria, and it is not the only one. There are also others like ‘leverage’, ‘explore’ or ‘tapestry’ that are more common in African English. According to 311institute, although human feedback is very small compared to the enormous amounts of training data, it has a great impact since it is what defines the tone of the model when responding to us.
Data labeling. It is a key step for training large language models and requires humans to be behind it. The problem is that the majority of workers who dedicate themselves to this are from impoverished countries such as Nigeria, Kenya or India, among others. As if the endless hours and ridiculous salaries were not enough, workers often have to review violent and very explicit images, all without any type of psychological support.
In WorldOfSoftware | Being a porn moderator is not fun at all. He was exposed to “extreme, violent, graphic and sexually explicit content”
Imagen | National Institute of Allergy and Infectious Diseases en Unsplash
