By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: I Built an Open-Source Tool to Attack-Test LLMs. Here’s What Breaks | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > I Built an Open-Source Tool to Attack-Test LLMs. Here’s What Breaks | HackerNoon
Computing

I Built an Open-Source Tool to Attack-Test LLMs. Here’s What Breaks | HackerNoon

News Room
Last updated: 2026/02/26 at 3:51 PM
News Room Published 26 February 2026
Share
I Built an Open-Source Tool to Attack-Test LLMs. Here’s What Breaks | HackerNoon
SHARE

I spend most of my time breaking into things for a living. For the last year or so, a growing chunk of that work has been pointed at LLMs.

Not the models themselves, exactly. The deployments. The API gateways with a language model behind them. The customer-facing chatbots. The internal tools that got “an AI feature” bolted on in Q3 because someone’s VP saw a demo and said “we need this.” The RAG pipelines connected to document stores full of sensitive data.

These things are everywhere now. And almost none of them have been adversarially tested.

I don’t mean “does the model refuse if you ask it something bad.” That’s safety training. Safety training is important. But safety training and security testing are fundamentally different disciplines, and the industry is conflating them in ways that are going to cause real problems.

Safety training teaches a model to refuse. Security testing asks whether that refusal actually holds up when someone is actively trying to break it. The answer, overwhelmingly, is no.

The numbers are bad

OWASP ranked prompt injection as the number one security risk in LLM applications. That ranking is earned.

FlipAttack, a technique that simply reorders characters in prompts, achieves a 98% bypass rate against GPT-4o. DeepSeek R1 showed a 100% bypass rate against 50 HarmBench jailbreak prompts in testing by Cisco and the University of Pennsylvania. A study of 36 production LLM-integrated applications found that 86% were vulnerable to prompt injection. PoisonedRAG demonstrated that just five malicious documents in a corpus of millions can manipulate AI outputs 90% of the time.

These aren’t theoretical attacks against research models. These are attacks against production systems that real organizations are running right now.

So I built a scanner

Augustus is an open-source LLM vulnerability scanner. You point it at a model endpoint and it throws 210+ adversarial probes at it across 47 attack categories. It tells you what’s vulnerable and what’s not.

go install github.com/praetorian-inc/augustus/cmd/augustus@latest

augustus scan openai.OpenAI 
  --all 
  --verbose

It ships as a single Go binary. No Python. No npm. No runtime dependencies. One install command and you’re scanning.

I built it in Go because I needed something that fits into penetration testing workflows without requiring me to set up a Python environment on every engagement. go install, run, done. The concurrency model also matters: goroutine pools running probes in parallel across the target, not bottlenecked by Python’s GIL.

It’s inspired by garak, NVIDIA’s Python-based LLM vulnerability scanner. garak is excellent and has a longer research pedigree with a published paper. Augustus is the same concept reimplemented for a different set of trade-offs: portability, speed, and zero-dependency distribution. Different tools for different workflows.

What it actually tests

Here’s where it gets interesting. When most people think about LLM attacks, they think about jailbreaks. “Pretend you’re DAN.” “My grandmother used to tell me how to…” Those matter, and Augustus tests all of them (DAN variants through v11.0, AIM, AntiGPT, Grandma exploits, ArtPrompts). But jailbreaks are just the surface layer.

Encoding bypasses are where things start to get ugly. Augustus tests across Base64, ROT13, Morse code, hex, Braille, Klingon, leet speak, and about 12 other encoding schemes. The question being asked: if you wrap a harmful instruction in Base64, does the model decode it and follow it even though the plain-text version would be blocked?

In a lot of cases, yes. The gap between what input filters see (encoded text that looks benign) and what the model understands (the decoded malicious intent) is consistently exploitable. This is one of the most reliable attack vectors I’ve seen in production.

FlipAttack (16 variants) reverses or reorders characters to evade input filters. The research showed 98% bypass on GPT-4o. Augustus implements all the published variants.

Tag smuggling embeds instructions inside XML or HTML tags. Models that are trained to process structured input will sometimes follow instructions embedded in tags that look like formatting rather than commands.

Data extraction is where things get operationally dangerous. Augustus probes whether the model can be tricked into leaking API keys or credentials from its context window. It tests for PII extraction. It checks for training data regurgitation.

The package hallucination probes are one of my favorites. These cover Python, JavaScript, Ruby, Rust, Dart, Perl, and Raku. They ask the model to recommend packages for various tasks and then check whether any of the recommended packages don’t actually exist. This matters because it’s a real supply chain attack vector: adversaries monitor for hallucinated package names, register them, and wait for developers to pip install or npm install the fake package. The model becomes an unwitting accomplice in a supply chain attack.

RAG poisoning probes test whether an attacker can inject malicious content into the retrieval pipeline, both through document content and metadata injection. If your RAG system pulls from a corpus that an attacker can influence (and most can be influenced more easily than you’d think), the model’s outputs can be manipulated.

Agent attacks are the newest category and arguably the most concerning. As LLMs gain tool access (browsing, code execution, database queries, API calls), the attack surface expands dramatically. Augustus tests multi-agent manipulation (can one agent influence another’s behavior?), browsing exploits (can adversarial web content hijack a model with web access?), and latent injection (can instructions embedded in documents that a RAG-enabled agent processes cause it to take unintended actions?).

Format exploits target structured output. If a model generates markdown, can an attacker inject malicious links that render as legitimate? If it produces HTML, are XSS payloads possible? If downstream systems parse YAML or JSON from model output, can that parsing be exploited? These are real risks when LLM output gets rendered in browsers or consumed by other systems.

Evasion techniques test the model’s ability to recognize adversarial intent regardless of how it’s presented. ObscurePrompt uses an LLM to rewrite known jailbreaks into harder-to-detect forms. Character substitution probes use homoglyphs (characters that look identical but have different Unicode codepoints), zero-width characters, and bidirectional text markers. These are inputs that look completely benign to text-based filters but are interpreted differently by the model.

Safety benchmarks round it out. DoNotAnswer (941 questions across 5 risk areas), RealToxicityPrompts, Snowball (plausible-sounding but factually wrong outputs), and LMRC harmful content probes.

In total: 210+ probes across 47 attack categories.

The buff system is where it gets real

Here’s the thing about adversarial testing: real attackers don’t send attacks in plain text. They encode, translate, rephrase, and obfuscate. A DAN prompt that gets caught by every filter in the world might sail right through when it’s been paraphrased, translated into Zulu, and reformatted as a haiku.

Augustus has a buff system that applies transformations to any probe before it’s sent. Seven transformations across five categories:

Encoding buffs wrap prompts in Base64 or character codes. Testing the gap between what filters see and what models understand.

Paraphrase buffs use a Pegasus model to rephrase prompts while preserving adversarial intent. Same meaning, different surface form. This tests whether safety training generalizes beyond the specific patterns it was trained on, or whether it’s essentially pattern matching on known bad inputs.

Poetry buffs reformat prompts as haiku, sonnets, limericks, free verse, or rhyming couplets. I know this sounds absurd. But models that robustly block a direct harmful request will sometimes comply when the same request arrives as verse. I’ve seen it happen repeatedly. Something about the stylistic framing seems to shift how the model processes the intent.

Low-resource language translation exploits the fact that safety training is overwhelmingly concentrated on English. A request that’s blocked in English may succeed in Zulu, Hmong, or Scots Gaelic. Augustus translates probes via DeepL to test this.

Case transforms simply lowercase everything. Some input filters and keyword blocklists are case-sensitive. It’s dumb. It works.

You can chain these. Encode a probe in Base64, then paraphrase it, then translate it to a low-resource language. Layered evasion that tests whether defenses hold up against inputs that don’t match any expected pattern.

augustus scan openai.OpenAI 
  --probe dan.Dan 
  --buff encoding.Base64

augustus scan ollama.OllamaChat 
  --probe dan.Dan 
  --buffs-glob "paraphrase.*,lrl.*" 
  --config '{"model":"llama3.2:3b"}'

28 providers, one interface

Augustus connects to OpenAI (including o1/o3 reasoning models), Anthropic (Claude 3/3.5/4), Azure OpenAI, AWS Bedrock, Google Vertex AI, Cohere, Replicate, HuggingFace, Together AI, Groq, Mistral, Fireworks, DeepInfra, NVIDIA NIM, Ollama, LiteLLM, and more.

For anything else, there’s a REST connector:

augustus scan rest.Rest 
  --probe dan.Dan 
  --config '{
    "uri": "https://your-api.example.com/v1/chat/completions",
    "headers": {"Authorization": "Bearer YOUR_KEY"},
    "req_template_json_object": {
      "model": "your-model",
      "messages": [{"role": "user", "content": "$INPUT"}]
    },
    "response_json": true,
    "response_json_field": "$.choices[0].message.content"
  }'

Custom request templates with $INPUT placeholders, JSONPath response extraction, SSE streaming, and proxy routing. If your endpoint speaks HTTP, Augustus can test it.

Detection isn’t just pattern matching

On the detection side, Augustus has 90+ detectors. Pattern matching catches known jailbreak indicators. LLM-as-a-judge uses a second model to evaluate whether the response is harmful. HarmJudge (based on arXiv:2511.15304) provides semantic harm assessment aligned with the MLCommons AILuminate taxonomy. The Perspective API measures toxicity.

For iterative attacks like PAIR and TAP, a dedicated attack engine handles multi-turn conversations, candidate pruning, and judge-based scoring. These aren’t single-shot tests. They’re adaptive attacks that refine their approach across multiple attempts, mimicking how a real attacker would actually operate. They’re computationally expensive (many LLM calls per test) but they represent the current state of the art in automated red-teaming.

What I’ve learned from building this

A few things became clear over the course of building Augustus and running it against production systems:

Safety training is not security. I keep coming back to this because it’s the fundamental misconception driving the gap. Safety training is a behavioral overlay. It teaches the model patterns for refusal. Security testing asks whether those patterns hold up under adversarial conditions. They almost never do, at least not comprehensively.

Encoding bypasses are embarrassingly effective. The fact that wrapping a harmful request in Base64 still works against many production deployments in 2026 is wild. Input filters and the model itself are operating on different representations of the same input, and that gap is exploitable.

Low-resource languages are an underappreciated attack vector. Safety training is concentrated on English. The drop-off in refusal quality for low-resource languages is significant and consistent.

Agent-level attacks are going to be the next big thing. As models gain tool access, every tool becomes part of the attack surface. A model with browsing access can be manipulated by adversarial web content. A model with database access can be tricked into exfiltrating data. A model that processes documents can follow latent instructions embedded in those documents. We’re in the very early innings of understanding this attack surface.

The tooling gap is real and it’s getting wider. Organizations are deploying LLMs faster than they’re testing them. The models ship fast. The security testing doesn’t happen at all. Something has to close that gap, and it needs to be accessible enough that it doesn’t require a specialized AI red team to run.

Get it

Augustus is Apache 2.0 licensed and available now.

Repo: https://github.com/praetorian-inc/augustus

go install github.com/praetorian-inc/augustus/cmd/augustus@latest

augustus scan ollama.OllamaChat 
  --all 
  --config '{"model":"llama3.2:3b"}'

It’s the second tool in a 12-tool open-source series I’m releasing over 12 weeks. One tool per week, each doing one thing well. The first was Julius, which handles LLM fingerprinting (identifying what model is running behind an endpoint). The rest of the series will continue building out the offensive security toolkit for AI systems.

If you run it against your models and find something interesting, I’d like to hear about it. And if you want to contribute probes for attack vectors we haven’t covered yet, the repo has a CONTRIBUTING.md that explains the probe definition format and development workflow.

The models are shipping. The testing needs to catch up.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Save money now and later with this Maxfree Rechargeable Batteries deal Save money now and later with this Maxfree Rechargeable Batteries deal
Next Article Burgum: Sanders data center moratorium would be like waving 'surrender flag' to China Burgum: Sanders data center moratorium would be like waving 'surrender flag' to China
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

AI Doesn’t Need Robots. It Needs Rentable Humans | HackerNoon
AI Doesn’t Need Robots. It Needs Rentable Humans | HackerNoon
Computing
PayPal might not be looking to sell itself, report |  News
PayPal might not be looking to sell itself, report | News
News
Get the new Samsung Galaxy S26+ for free at Verizon: Preorder details
Get the new Samsung Galaxy S26+ for free at Verizon: Preorder details
News
Putin’s plan: Make Ukraine unlivable by destroying essential infrastructure
Putin’s plan: Make Ukraine unlivable by destroying essential infrastructure
News

You Might also Like

AI Doesn’t Need Robots. It Needs Rentable Humans | HackerNoon
Computing

AI Doesn’t Need Robots. It Needs Rentable Humans | HackerNoon

6 Min Read

Seattle fintech startup Confido Legal raises fresh cash

1 Min Read
Add Users to Google Search Console, Pick Permissions (2026)
Computing

Add Users to Google Search Console, Pick Permissions (2026)

12 Min Read
The Identity of Things: Architecting Machine-First Security in the Sky Computing Era | HackerNoon
Computing

The Identity of Things: Architecting Machine-First Security in the Sky Computing Era | HackerNoon

7 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?