By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: Typographic Attacks on Vision-LLMs: Evaluating Adversarial Threats in Autonomous Driving Systems | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > Typographic Attacks on Vision-LLMs: Evaluating Adversarial Threats in Autonomous Driving Systems | HackerNoon
Computing

Typographic Attacks on Vision-LLMs: Evaluating Adversarial Threats in Autonomous Driving Systems | HackerNoon

News Room
Last updated: 2025/09/27 at 11:23 AM
News Room Published 27 September 2025
Share
SHARE

Table of Links

Abstract and 1. Introduction

  1. Related Work

    2.1 Vision-LLMs

    2.2 Transferable Adversarial Attacks

  2. Preliminaries

    3.1 Revisiting Auto-Regressive Vision-LLMs

    3.2 Typographic Attacks in Vision-LLMs-based AD Systems

  3. Methodology

    4.1 Auto-Generation of Typographic Attack

    4.2 Augmentations of Typographic Attack

    4.3 Realizations of Typographic Attacks

  4. Experiments

  5. Conclusion and References

Abstract

Vision-Large-Language-Models (Vision-LLMs) are increasingly being integrated into autonomous driving (AD) systems due to their advanced visual-language reasoning capabilities, targeting the perception, prediction, planning, and control mechanisms. However, Vision-LLMs have demonstrated susceptibilities against various types of adversarial attacks, which would compromise their reliability and safety. To further explore the risk in AD systems and the transferability of practical threats, we propose to leverage typographic attacks against AD systems relying on the decision-making capabilities of Vision-LLMs. Different from the few existing works developing general datasets of typographic attacks, this paper focuses on realistic traffic scenarios where these attacks can be deployed, on their potential effects on the decision-making autonomy, and on the practical ways in which these attacks can be physically presented. To achieve the above goals, we first propose a dataset-agnostic framework for automatically generating false answers that can mislead Vision-LLMs’ reasoning. Then, we present a linguistic augmentation scheme that facilitates attacks at image-level and region-level reasoning, and we extend it with attack patterns against multiple reasoning tasks simultaneously. Based on these, we conduct a study on how these attacks can be realized in physical traffic scenarios. Through our empirical study, we evaluate the effectiveness, transferability, and realizability of typographic attacks in traffic scenes. Our findings demonstrate particular harmfulness of the typographic attacks against existing Vision-LLMs (e.g., LLaVA, Qwen-VL, VILA, and Imp), thereby raising community awareness of vulnerabilities when incorporating such models into AD systems. We will release our source code upon acceptance.

1 Introduction

Vision-Language Large Models (Vision-LLMs) have seen rapid development over the recent years [1, 2, 3], and their incorporation into autonomous driving (AD) systems have been seriously considered by both industry and academia [4, 5, 6, 7, 8, 9]. The integration of Vision-LLMs into AD systems showcases their ability to convey explicit reasoning steps to road users on the fly and satisfy the need for textual justifications of traffic scenarios regarding perception, prediction, planning, and control, particularly in safety-critical circumstances in the physical world. The core strength of VisionLLMs lies in their auto-regressive capabilities through large-scale pretraining with visual-language alignment [1], making them even able to perform zero-shot optical character recognition, grounded reasoning, visual-question answering, visual-language reasoning, etc. Nevertheless, despite their impressive capabilities, Vision-LLMs are unfortunately not impervious against adversarial attacks that can misdirect the reasoning processes [10]. Any successful attack strategies have the potential to pose critical problems when deploying Vision-LLMs in AD systems, especially those that may even bypass the models’ black-box characteristics. As a step towards their reliable adoption in AD, studying the transferability of adversarial attacks is crucial to raising awareness of practical threats against deployed Vision-LLMs, and to efforts in building appropriate defense strategies for them.

In this work, we revisit the shared auto-regressive characteristic of different Vision-LLMs and intuitively turn that strength into a weakness by leveraging typographic forms of adversarial attacks, also known as typographic attacks. Typographic attacks were first studied in the context of the well-known Contrastive Language-Image Pre-training (CLIP) model [11, 12]. Early works in this area focused on developing a general typographic attack dataset targeting multiple-choice answering (such as object recognition, visual attribute detection, and commonsense answering) and enumeration [13]. Researchers also explored multiple-choice self-generating attacks against zero-shot classification [14], and proposed several defense mechanisms, including keyword-training [15] and prompting the model for detailed reasoning [16]. Despite these initial efforts, the methodologies have neither seen a comprehensive attack framework nor been explicitly designed to investigate the impact of typographic attacks on safety-critical systems, particularly those in AD scenarios.

Our work aims to fill this research gap by studying typographic attacks from the perspective of AD systems that incorporate Vision-LLMs. In summary, our scientific contributions are threefold:

• Dataset-Independent Framework: we introduce a dataset-independent framework designed to automatically generate misleading answers that can disrupt the reasoning processes of Vision-Large Language Models (Vision-LLMs).

• Linguistic Augmentation Schemes: we develop a linguistic augmentation scheme aimed at facilitating stronger typographic attacks on Vision-LLMs. This scheme targets reasoning at both the image and region levels and is expandable to multiple reasoning tasks simultaneously.

• Empirical Study in Semi-Realistic Scenarios: we conduct a study to explore the possible implementations of these attacks in real-world traffic scenarios.

Through our empirical study of typographic attacks in traffic scenes, we hope to raise community awareness of critical typographic vulnerabilities when incorporating such models into AD systems.

:::info
This paper is available on arxiv under CC BY 4.0 DEED license.

:::

:::info
Authors:

(1) Nhat Chung, CFAR and IHPC, A*STAR, Singapore and VNU-HCM, Vietnam;

(2) Sensen Gao, CFAR and IHPC, A*STAR, Singapore and Nankai University, China;

(3) Tuan-Anh Vu, CFAR and IHPC, A*STAR, Singapore and HKUST, HKSAR;

(4) Jie Zhang, Nanyang Technological University, Singapore;

(5) Aishan Liu, Beihang University, China;

(6) Yun Lin, Shanghai Jiao Tong University, China;

(7) Jin Song Dong, National University of Singapore, Singapore;

(8) Qing Guo, CFAR and IHPC, A*STAR, Singapore and National University of Singapore, Singapore.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article An App Used to Dox Charlie Kirk Critics Doxed Its Own Users Instead
Next Article Anthros Chair Review: A Comfortable Seat That Improves Posture and Relieves Pain
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

I didn’t think GoodNotes could get better but it just did
Computing
How to Use pandas DataFrames in Python to Analyze and Manipulate Data
News
Sony’s WH-CH520 are the best headphones under $50, period
News
How to Run Your Own Instagram Audit (+ Free Checklist) |
Computing

You Might also Like

Computing

I didn’t think GoodNotes could get better but it just did

8 Min Read
Computing

How to Run Your Own Instagram Audit (+ Free Checklist) |

8 Min Read
Computing

How to Run an Effective Level 10 Meeting |

19 Min Read
Computing

Implementation Details of Tree-Diffusion: Architecture and Training for Inverse Graphics | HackerNoon

3 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?