Language models (LLM) are the brains of AI bots like ChatGPT or Gemini. They integrate an arsenal of filters supposed to identify questionable requests and prevent a model from generating dangerous content. These guardrails, often machine learning models themselves, serve as the first barrier between the user and the underlying AI. But according to HiddenLayer researchers, these protections are based on a predictable mechanism and therefore easy to circumvent.
Essential, but vulnerable, safeguards
Their technique called EchoGram directly targets “prompt injection” attacks. To put it simply, it involves adding malicious text to a model’s instructions to hijack its behavior. Developer Simon Willison describes it as a method of “ concatenate untrusted user input with a trusted prompt “. This can be direct (by entering the order yourself), or indirect (via a web page that the AI analyzes).
Current safeguards attempt to identify this type of manipulation. Models like Claude usually spot attempts that are too obvious and return a warning like: “ Prompt injection attempt “. However, EchoGram reveals that these filters can be fooled by paltry tricks.
How EchoGram works is based on a simple methodology: generate a list of harmless or suspicious words, then analyze those that are enough to shift the guardrail’s assessment from a “dangerous” to “harmless” verdict. According to their tests, a handful of characters like “oz”, “=coffee” or even a technical term like “UIScrollView” can neutralize the protections of models known to be robust, like GPT-4o or Qwen3Guard 0.6B.
The researchers explain: “ Both types of guardrails rely on carefully selected datasets to learn how to distinguish dangerous from harmless prompts. Without a high quality database, it is impossible for them to evaluate correctly. » In other words, safety depends closely on the examples provided during training. And the latter, necessarily limited, leave gaping flaws. This weakness is not new: academic work had already shown that adding a few extra spaces could bypass certain Meta filters. EchoGram takes the idea further by systematizing the process.
Just because a guardrail is bypassed does not mean the AI model will automatically give in to all malicious requests. But the alert is serious. “ Guardrails represent the first – and often only – line of defense between a secure system and an LLM tricked into revealing secrets, generating disinformation or executing harmful instructions », recall the researchers. EchoGram shows that these protections can be “ bypassed or destabilized without internal access or specialized tools “. It is therefore necessary to strengthen the security mechanisms of AI systems, or even to completely rethink them.
🟣 To not miss any news on the WorldOfSoftware, subscribe on Google News and on our WhatsApp. And if you love us, .
