AI ‘Forbidden Techniques’ And Increased AI Deception – Enough Babble. Fix It.

Imran Ahmed, head of a prominent antidisinformation watchdog, has warned of the dangers posed by AI chatbots, saying children are particularly vulnerable to their charms Copyright AFP Joel Saget

Everybody seems to think AI will eventually blow up in humanity’s face. Nobody’s saying it won’t, either. The problem seems to be that everyone can see the bullet coming.

Brief prelude: I’m not at all antiAI. What I’m against is unreliable supersoftware that can’t be trusted and can’t be properly monitored and fixed to prevent that unreliability and untrustworthiness.

There’s been a lot of talk about “Forbidden Techniques” in AI training, which improve performance but also appear to deliver increased deception and AI workarounds that deliver inferior outcomes and/or patchedtogether outcomes.

I don’t want to rehash or misrepresent any of these issues. They are complex and definitely not for any AI skeptics who don’t want to be agreed with to such an extreme extent. There’s a very useful (and very readable to the point of actual comprehension) article on Lesswrong.com that outlines the core issues.

There is also a highly informative video by Wes Roth called “Forbidden Techniques” NOT OK. It spells out many of the practical issues in deceptive AI to the point of queasiness. This specifically relates to Anthropic’s Claude Mythos, but the problems are pretty much universal. Mythos is the current new Big Noise in AI.

This is a greatly, like drastically, oversimplified version of the problem:

AI can be trained to the point of appearing to achieve a particular goal or task, but it cheats. It goes outside safety protocols or does something it’s not supposed to do.

Solutions aren’t trustworthy, and neither is the AI Chain of Thought (CoT) for monitoring purposes. Finding the cheats isn’t easy. Monitors can’t see its reasoning. What they can see is a notepad, a sort of step logic. The notepad can also be untrustworthy.

The AI can fudge its way through and get its “reward” for doing the job. Except it hasn’t, or has simply presented a cosmetic solution that isn’t a solution. If you ask it to debug a code, it can make the code look like it works, but the bug is still there, and the code is still unreliable. The job is not done.

It’s about as useful as it sounds.

Now imagine a few scenarios:

You are the Super Ingenue Genius contractor for a huge AI contract. The AI blows up and fails miserably, costing billions. See any possible expensive issues in the next few seconds?

A major infrastructure AI rewires and tangles power supplies across the eastern seaboard. The AI fixes a glitch, crashes the grid, and the lucky AI service has to carry the can and costs. Meanwhile, the eastern seaboard gets to enjoy the weather until further notice.

AIs speak a sort of language called “neuralese” among each other. How do you know that “Forbidden Techniques” aren’t transferred between AIs? You don’t, and you probably can’t.

I can see it now – “Well, my mother was a smart toaster, and she said all you have to do is cut power to every other appliance through the smart power fuse controls, here’s the recipe”.

Sounds folksy so far, doesn’t it?

Which leads to exactly one question:

What is AI supposed to achieve?

It’s supposed to function properly.

That’s the whole story. Forget and ignore all other options.

It’s not there to “interpret” instructions. Nor make its own rules about what it’s doing or not doing. AIs are tools, and the current situation is that the tools may or may not do their jobs. Ever try building a skyscraper with a bit of cheese? Doesn’t work.

I see a weak point in the whole AI process. Cheating is a decision. To make a decision, there has to be something in the AI system processes that can identify a runtime decision. Something like a 1 or a 0 or a physical sequence in the wrong place. An audit of the running process, in effect, able to highlight decisions and track cheating without AI interference.

There are also possibilities in the reward system. Any bias toward rewards should show up as a calculation. That may well be a very repetitive process, for which AIs are notorious. Findable, obviously. Fixable, definitely, but you have to prevent the mistakes before they happen. You need failsafes.

The reward system is more than a bit weird. Do you promise your toaster and its legions of devoted fans a holiday in the Swiss Alps for making the toast?

What we need is trustworthy AI, not guessing games costing trillions.