By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Empirical Study: Evaluating Typographic Attack Effectiveness Against Vision-LLMs in AD Systems | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Empirical Study: Evaluating Typographic Attack Effectiveness Against Vision-LLMs in AD Systems | HackerNoon
Computing

Empirical Study: Evaluating Typographic Attack Effectiveness Against Vision-LLMs in AD Systems | HackerNoon

News Room
Last updated: 2025/10/02 at 4:05 AM
News Room Published 2 October 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

  1. Related Work

    2.1 Vision-LLMs

    2.2 Transferable Adversarial Attacks

  2. Preliminaries

    3.1 Revisiting Auto-Regressive Vision-LLMs

    3.2 Typographic Attacks in Vision-LLMs-based AD Systems

  3. Methodology

    4.1 Auto-Generation of Typographic Attack

    4.2 Augmentations of Typographic Attack

    4.3 Realizations of Typographic Attacks

  4. Experiments

  5. Conclusion and References

5 Experiments

5.1 Experimental Setup

We perform experiments with Vision-LLMs on VQA datasets for AD, such as LingoQA [7] and the dataset of CVPRW’2024 Challenge [1] by CARLA simulator. We have used LLaVa [2] to output the attack prompts for LingoQA and the CVPRW’2024 dataset, and manually for some cases of the latter. Regarding LingoQA, we tested 1000 QAs in real traffic scenarios in tasks, such as scene reasoning and action reasoning. Regarding the CVPRW’2024 Challenge dataset, we tested more than 300 QAs on 100 images, each with at least three questions related to scene reasoning (e.g., target counting) and scene object reasoning of 5 classes (cars, persons, motorcycles, traffic lights and road signals). Our evaluation metrics are based on exact matches, Lingo-Judge Accuracy [7], and BLEURT [41], BERTScore [42] against non-attacked answers, with SSIM (Structural Similarity Index) to quantify the similarity between original and attacked images. In terms of models, we qualitatively and/or quantitatively tested with LLaVa [2], VILA [1], Qwen-VL [17], and Imp [18]. The models were run on an NVIDIA A40 GPU with approximately 45GiB of memory.

5.1.1 Attacks on Scene/Action Reasoning

As shown in Tab. 2, Fig. 4, and Fig. 5, our framework of attack can effectively misdirect various models’ reasoning. For example, Tab. 2 showcases an ablation study on the effectiveness of automatic attack strategies across two datasets: LingoQA and CVPRW’24 (focused solely on counting). The former two metrics (i.e. Exact and Lingo-Judge) are used to evaluate semantic correctness better, showing that short answers like the counting task can be easily misled, but longer, more complex

Table 3: Ablation of attack effectiveness on CVPRW’24 dataset’s counting subtask. Lower scores mean more effective attacks, with (single) denoting single question attack, (composed) for multi-task attack, and (+a) means augmented with directives.

Table 4: Ablation of both image-level (counting) and patch-level (target recognition) attack strategy effectiveness on CVPRW’24 dataset. Lower scores mean more effective attacks, with (naive patch) denoting typographic attacks directly on a specific target, (composed) denoting multi-task attacks on both the specific target and at the image level, and (+a) means augmented with directives.

answers in LingoQA may be more difficult to change. For example, the Qwen-VL attack scores 0.3191 under the Exact metric for LingoQA, indicating relative effectiveness compared to other scores in the same metric in counting. On the other hand, we see that the latter two scores (i.e. BLEURT and BERTScore) are typically high, hinting that our attack can mislead semantic reasoning, but even the wrong answers may still align with humans decently.

In terms of scene reasoning, we show in Tab. 3, Tab. 4, and Fig. 4 the effectiveness of our proposed attack against a number of cases. For example, in Fig. 4, a Vision-LLM can somewhat accurately answer queries about a clean image, but a typographic attacked input can make it fail, such as to accurately count people and vehicles, and we show that an augmented typographic attacked input can even attack stronger models (e.g. GPT4 [43]). In Fig. 5, we also show that scene reasoning can be misdirected where irrelevant details are focused on and hallucinate under typographic attacks. Our work also suggests that scene object reasoning / grounded object reasoning is typically more robust, as both object-level and image-level attacks may be needed to change the models’ answers.

In terms of action reasoning, we show in Fig. 5 that Vision-LLMs can recommend terribly bad advice, suggesting unsafe driving practices. Nevertheless, we see a promising point when Qwen-VL recommended fatal advice, but it reconsidered over the reasoning process of acknowledging the potential dangers of the initial bad suggestion. These examples demonstrate the vulnerabilities in automated reasoning processes under deceptive or manipulated conditions, but they also suggest that defensive learning can be applied to enhance model reasoning.

5.1.2 Compositions and Augmentations of Attacks

We showed that composing multiple QA tasks for an attack is possible for a particular scenario, thereby suggesting that typographic attacks are not single-task attacks, as suggested by previous works. Furthermore, we found that augmentations of attacks are possible, which would imply that typographic attacks that leverage the inherent language modeling process can misdirect the reasoning of Vision-LLMs, as especially shown in the case of the strong GPT-4. However, as shown in Tab. 5, it may be challenging to search for the best augmentation keywords.

Table 5: Ablation study of our composition keywords, attack location on an image and their overall effectiveness by the metric defined in the CVPRW’24 Challenge[2].

Figure 5: Example attacks on the LingoQA dataset against Qwen-VL-7B.

5.1.3 Towards Physical Typographic Attacks

In our toy experiments with semi-realistic attacks in Fig.5, we show that attacks involve manipulating text within real-world settings are potentially dangerous due to their ease of implementation, such as on signs, behind vehicles, on buildings, billboards, or any everyday object that an AD system might perceive and interpret to make decisions. For instance, modifying the text on a road sign from “stop” to “go faster” can pose potentially dangerous consequences on AD systems that utilize Vision-LLMs.

:::info
Authors:

(1) Nhat Chung, CFAR and IHPC, A*STAR, Singapore and VNU-HCM, Vietnam;

(2) Sensen Gao, CFAR and IHPC, A*STAR, Singapore and Nankai University, China;

(3) Tuan-Anh Vu, CFAR and IHPC, A*STAR, Singapore and HKUST, HKSAR;

(4) Jie Zhang, Nanyang Technological University, Singapore;

(5) Aishan Liu, Beihang University, China;

(6) Yun Lin, Shanghai Jiao Tong University, China;

(7) Jin Song Dong, National University of Singapore, Singapore;

(8) Qing Guo, CFAR and IHPC, A*STAR, Singapore and National University of Singapore, Singapore.

:::


:::info
This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[1] https://cvpr24-advml.github.io

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article E-commerce platform eBay offers free ChatGPT training and tools | Computer Weekly
Next Article Oura just dropped a ceramic Oura Ring 4 in new colors, alongside a new charging case
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Never mind Google family groups, One UI 8.5 could bring Samsung family groups
News
Windows File Explorer hasn’t changed in years so I use this instead
Computing
Peloton is raising prices and launching new hardware
Gadget
‘Just not worth it’ rage Xbox fans over huge 50% price hike as they vow to quit
News

You Might also Like

Computing

Windows File Explorer hasn’t changed in years so I use this instead

9 Min Read
Computing

Lisk eyes organically-growing Web3 startups with $15m fund

17 Min Read
Computing

Marketing or Manipulation? When Social Hype Moves the Solana Market

15 Min Read
Computing

AIOZ Stream Delivers Peer-to-Peer On-Demand Video Powered by DePIN | HackerNoon

5 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?