In recent months, many of us have spoken to an artificial intelligence without thinking too much about it. We have asked him questions, we have asked him for advice or we have simply tested how far his ability to keep a natural conversationl. Tools like ChatGPT or Gemini voice modes have brought that experience closer to something that, not so long ago, seemed reserved for science fiction, with inevitable echoes of ‘Her’. But there’s one question we rarely ask ourselves while talking to them: how have these machines learned to sound less and less like a system and more like a person.
To understand it, it is convenient to separate what we see from what we do not see. On the one hand there are the applications that we use daily, those assistants that respond with an increasingly natural voice. On the other hand, the systems that support them, models trained with large volumes of data that need to learn not only what to say, but also how to say it. We do not know which specific products end up using this type of recording, but we do know that they are part of the ecosystem with which increasingly fluid and credible voice systems are trained.
The human hand behind an artificial voice
When we get down to the details, what these workers do is not very similar to the classic idea of “training an AI.” In many cases, it involves having conversations with strangers about seemingly trivial topics, from everyday tastes to open questions that require you to develop an answer. In others, the assignment is more demanding: playing a role, following a script without seeming like it or enter emotional terrain. Bloomberg tells, for example, the case of a worker who recounted painful memories of her life while speaking with a man who introduced himself as a pastor and who, within the exercise, played the role of a therapist.
All that recorded material serves a very specific purpose: capturing nuances. We are not just talking about words, but about pauses, breaths, changes in tone, hesitations or emotional reactions that make a conversation sound human. There are also labeling tasks, in which workers have to distinguish whether an audio contains a sob, a laugh, or someone talking between laughs. The underlying logic is simple: if a machine wants to stop sounding robotic, it first needs to be exposed to how we really speak.
After passing an initial voice test, they can qualify for tasks that start at about $17 per recorded hour.
From there, the question is inevitable: how do you access this type of job and how much do you really earn? Platforms like Babel Audio work as intermediaries that connect these workers with specific projects. After passing an initial voice test, they can opt for tasks that start at around $17 per recorded hour, although the final income depends on the evaluation received and the volume of orders available. Income also varies greatly: a worker cited by the aforementioned media claims to earn about 600 dollars a week.

This is what the BabelAudio website looks like
As we progress, the work begins to show a less visible side. Beyond the rates and the promise of flexibility, the testimonies point to an environment marked by uncertainty and constant control. Platforms can limit access to tasks, interrupt projects or suspend accounts without detailed explanations, leaving many workers in a fragile position. In addition, each conversation is subject to real-time metrics that assess whether someone speaks too much or too little, expressiveness, language proficiency, depth of exchange and even the length of pauses.
When we broaden the focus, the debate stops being solely work-related and also becomes personal. Part of the value of these recordings lies precisely in the fact that they capture how we speak and how we relate, which implies that workers are providing more than just a mechanical task. The terms generally allow those recordings to be used in voice assistants, speech synthesis, and “other audio-related products and services.”

When we connect all the pieces, what we see is an industry that works thanks to a complex production chain. The Pulitzer Center describes this ecosystem as a fragmented work network in which workers are often subject to confidentiality agreements, operate with very little transparency and, in many cases, do not even know what system they are training or what company their work ends up going to. In this context, the conversations that feed voice systems are only one part of a larger machine, where each task contributes to building increasingly sophisticated technologies.
Images | WorldOfSoftware with Nano Banana 2 | Screenshot
In WorldOfSoftware | Congratulations, you already program without knowing how to program. Now prepare to wait six weeks for Apple to listen to you
