To prevent prompt injection attacks when working with untrusted sources, Google DeepMind researchers have proposed CaMeL, a defense layer around LLMs that blocks malicious inputs by extracting the control and data flows from the query. According to their results, CaMeL can neutralize 67% of attacks in the AgentDojo security benchmark.
It is well known that adversaries can inject malicious data or instructions into an LLM’s context to exfiltrate data or direct it to use tools in a harmful way. For instance, an attacker might attempt to discover a chatbot’s system prompt to gain control or steal sensitive information– such as accessing data on private Slack channels. Even more concerning is when LLMs have access to tools to carry through actions with a real-world impact, such send sending an email or placing an order.
Even when LLMs implement specific strategies to protect themselves from prompt injection, attackers continue to find ways to bypass these defenses. One recent example is a phishing-style attack demonstrated by AI security Johann Rehberger, who successfully circumvented Gemini’s safeguard against delayed tool execution.
CaMel is a new proposal to address this kinds of risks. Rather than relying on more AI to defend AI systems, such as an AI-based prompt injection detector, CaMeL applies traditional software security principles such as control flow integrity, access control, and information flow control.
CaMeL associates, to every value, some metadata (commonly called capabilities in the software security literature) to restrict data and control flows, giving the possibility to express what can and cannot be done with each individual value by using fine-grained security policies.
CaMel uses a custom Python interpreter to track the origin of data and instructions, enforcing capability-based security guarantees which do not require to modify the LLM itself. To this end, it leverages the Dual LLM pattern described by Simon Willison, who originally coined the term “prompt injection”, and extends it in a clever way.
Willison’s original proposal features a privileged LLM that processes the user’s prompt directly, and a quarantined LLM exposed to untrusted data with no access to tools. The privileged LLM is manages the workflow and may ask the quarantined LLM to extract specific information, such as en email address, from untrusted data. This ensures that the privileged LLM is never exposed to untrusted tokens, but only to the filtered results returned by the quarantined model.
The weak point in this scheme, Google researchers say, is that an attacker could still manipulate the quarantined LLM into producing misleading output, e.g., the email address of a recipient not authorized to access sensitive information.
In their new approach, the privileged LLM generates a program written in a restricted subset of Python, responsible for carrying through all required steps. When this program receives data from the quarantined LLM or other tools, it constructs a data flow graph tracking each data element’s origin, access rights, and relevant metadata. This metadata is then used to ensure that any operation on the data complies with privilege restrictions.
As Willison notes in his reaction to CaMeL proposal, the importance of this approach lies in not relying on more AI to solve AI problems:
AI techniques use probabilities: you can train a model on a collection of previous prompt injection examples and get to a 99% score in detecting new ones… and that’s useless, because in application security 99% is a failing grade.
To test CaMeL’s effectiveness, DeepMind researchers integrated it into AgentDojo, a security benchmark featuring a set of realistic utility and security tasks for autonomous agents.
CaMeL is not a perfect solution to LLM security, DeepMind researchers admit, with its most notable limitation being its reliance on users to define security policies. Moreover, as CaMeL may require users to manually approve privacy-sensitive tasks, there is a risk of user fatigue, which may lead to automatic, careless approvals.
For a more details discussion, don’t miss the original paper.