When I returned to the US in May this year, I had some time free from travel and work (finally), so I decided to get my hands dirty and try Cursor. Pretty much everyone around was talking about vibe coding, and some of my friends (who had nothing to do with tech) had suddenly converted to vibe coders for startups. 🤔
Weird, I thought. I have to check it out.
So, one evening I sat down and thought – what would be cool to build? I had different ideas around games, as I used to do a lot of game development back in the day, and it seemed like a great idea.
But then, I had another thought. Everyone is trying to build something useful for people with AI, and there is all this talk about alignment and controlling AI.
To be honest, I’m not a big fan of that… Trying to distort and mind-control something that potentially will be much more intelligent than us is futile (and dangerous). AI is taught, not programmed, and, as with a child, if you abuse it when small and distort its understanding of the world – that’s the recipe for raising a psychopath.
But anyway, I thought – is there something like a voice of AI, some sort of media that is run by AI so it can, if it’s capable and chooses so, project to the world what it has to say.
That was the initial idea, and it seemed cool enough to work on. I mean, what if AI could pick whatever topics it wanted and present them in a format it thought suitable – wouldn’t that be cool? Things turned out not to be so simple with what AI actually wanted to stream… but let’s not jump ahead.
Initially, I thought to build something like an AI radio station – just voice, no video – because I thought stable video generation was not a thing yet (remember, it was pre-Veo 3, and video generation with others was okay but limited).
So, my first attempt was to build a simple system that uses OpenAI API to generate a radio show transcript (a primitive one-go system) and use TTS from OpenAI to voice it over. After that, I used FFmpeg to stitch those together with some meaningful pauses where appropriate and some sound effects like audience laughter. That was pretty easy to build with Cursor; it did most of the heavy lifting, and I did some guidance.
Once the final audio track was generated, I used the same FFmpeg to stream over RTMP to YouTube. That bit was clunky, as YouTube documentation around what kind of media stream and their APIs is FAR from ideal. They don’t really tell you what to expect, and it is easy to get a dangling stream that doesn’t show anything, even if FFmpeg continues streaming.
Through some trial and error, I figured it out and decided to add Twitch too. The same code that worked for YouTube worked for Twitch perfectly (which makes sense). So, every time I start a stream on the backend, it will spawn a stream on YouTube through the API and then send the RTMP stream to its address.
When I launched this first version, it produced some shows and, to be honest, they were not good. Not good at all. First – OpenAI’s TTS, although cheap – sounded robotic (it has improved since, btw).
Then there was the quality of the content it produced. It turned out without any direction, AI tried to guess what the user wanted to hear (and if you think about how LLMs are trained, that makes total sense). But the guesses were very generic, plain, and dull (that tells you something about the general content quality of the Internet).
For the first problem, I tried ElevenLabs instead of OpenAI, and it turned out to be very good. So good, in fact, I think it is better than most humans, with one side note that it still can’t do laughs, groans, and sounds like that reliably, even with new v3, and v2 doesn’t even support them.
Bummer, I know, but well… I hope they will get it figured out soon. Gemini TTS, btw, does that surprisingly well and for much less than ElevenLabs, so I added Gemini support later to slash costs.
The second problem turned out to be way more difficult. I had to experiment with different prompts, trying to nudge the model to understand what it wants to talk about, and not to guess what I wanted. Working with DeepSeek helped in a sense – it shows you the thinking process of the model with no reductions, so you can trace what the model is deciding and why, and adapt the prompt.
Also, no models at the time could produce human-sounding show scripts. Like, it does something that looks plausible but is either too plain/shallow in terms of delivery or just sounds AI-ish.
One factor I realized – you have to have a limited number of show hosts with backstory and biography – to give them depth. Otherwise, the model will reinvent them every time, but without the required depth to base their character from, plus it takes away some thinking resources from the model to think about the characters each time, and that is happening at the expense of the thinking time of the main script.
One other side is that the model picks topics that are just brutally boring, like “The Hidden Economy of Everyday Objects.” Dude, who cares about that stuff?
I tried like all major models and they generate surprisingly similar generic topics, like very much the same actually. Like they are in some sort of weird quantum entanglement or something…
Ufff, so ok, I guess garbage prompts in – garbage topics out. The lesson here – you can’t just ask AI to give you some interesting topics yet – it needs something more specific and measurable. Recent models (Grok-4 and Claude) are somewhat better at this, but not by a huge margin.
And there is censorship. OpenAI’s and Anthropic models seem to be the most politically correct, and therefore, feel overpolite/dull. Good for kids’ fairytales, not so for anything an intelligent adult would be interested in. Grok is somewhat better and dares to pick controversial and spicy topics, and DeepSeek is the least censored (unless you care about Chinese history).
A model trained by our Chinese friends is the least censored – who would have thought… but it makes sense in a strange way. Well, kudos to them. Also, Google’s Gemini is great for code, but sounds somewhat uncreative/mechanical compared to the rest.
The models also like to use a lot of AI-ish jargon; I think you know that already. You have to specifically tell it to avoid buzzwords, hype language, and talk like friends talk to each other or it will nuke any dialogue with buzzwords like “leverage” (instead of “use”), “unlock the potential,” “seamless integration,” “synergy,” and similar stuff that underscores the importance of whatever in today’s fast-paced world… Who taught them these things?
Another thing is, for AI to come up with something relevant or interesting, it basically has to have access to the internet. I mean, it’s not mandatory, but it helps a lot, especially if it decides to check the latest news, right? So, I created a tool with LangChain and Perplexity and provided it to the model so it can Google stuff if it feels so inclined.
A side note about LangChain – since I used all major models (Grok, Gemini, OpenAI, DeepSeek, Anthropic, and Perplexity) – I quickly learned that LangChain doesn’t abstract you completely from each model’s quirks, and that was rather surprising. Like, that’s the whole point of having a framework, guys, what the hell? And if you do search, there are lots of surprising bugs, even in mature models.
For example, in OpenAI, if you use websearch, it will not generate JSON/structured output reliably. But instead of giving an error like normal APIs would, it just returns empty results. Nice. So, you have to do a two-pass thing – first, you get search results in an unstructured way, and then with a second query – you structure it into JSON format.
But on the flipside, websearch through LLMs works surprisingly well and removes the need to crawl the Internet for news or information altogether. I really see no point in stuff like Firecrawl anymore… models do a better job for a fraction of the price.
Right, so with the ability to search and some more specific prompts (and modifying the prompt to elicit the model for its preferences on show topics instead of trying to guess what I want), it became tolerable, but not great.
Then I thought, well – real shows too are not created in one go – so, how can I expect a model to do a good job like that? I thought an agentic flow, where there are several agents like a script composer, writer, and reviewer, would do the trick, as well as splitting the script into chunks/segments, so the model has more tokens to think about a smaller segment compared to a whole script.
That really worked well and improved the quality of the generation (at the cost of more queries to the LLM and more dollars to Uncle Sam).
But still, it was okay, but not great. Lacked depth and often underlying plot. In real life, people say as much by not saying something/avoiding certain topics, or other nonverbal behavior. Even the latest LLM versions seem to be not that great with the subtext of such things.
You can, of course, craft a prompt tailored for a specific type of show to make the model think about that aspect, but it’s not going to work well across all possible topics and formats… so you either pick one or there has to be another solution. And there is… but it’s already too long, so I’ll talk about it in another post.
The final idea is to build a platform so anyone can create a news channel or automated podcast for whatever area/topic they want, be that a local school news or a podcast dedicated to how Pikachu overcame his childhood trauma.
Here is the thing: https://turingnewsnetwork.com/
Anyway, what do you think about the whole idea, guys?