By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: What Is Text Sanitization? Definitions, Privacy Laws, and NLP Approaches | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > What Is Text Sanitization? Definitions, Privacy Laws, and NLP Approaches | HackerNoon
Computing

What Is Text Sanitization? Definitions, Privacy Laws, and NLP Approaches | HackerNoon

News Room
Last updated: 2025/04/28 at 9:29 PM
News Room Published 28 April 2025
Share
SHARE

Authors:

(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author ([email protected]);

(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;

(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Definitions

2.2 NLP Approaches

2.3 Privacy-Preserving Data Publishing

2.4 Differential Privacy

3 Datasets and 3.1 Text Anonymization Benchmark (TAB)

3.2 Wikipedia Biographies

4 Privacy-oriented Entity Recognizer

4.1 Wikidata Properties

4.2 Silver Corpus and Model Fine-tuning

4.3 Evaluation

4.4 Label Disagreement

4.5 MISC Semantic Type

5 Privacy Risk Indicators

5.1 LLM Probabilities

5.2 Span Classification

5.3 Perturbations

5.4 Sequence Labelling and 5.5 Web Search

6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics

6.2 Experimental Results and 6.3 Discussion

6.4 Combination of Risk Indicators

7 Conclusions and Future Work

Declarations

References

Appendices

A. Human properties from Wikidata

B. Training parameters of entity recognizer

C. Label Agreement

D. LLM probabilities: base models

E. Training size and performance

F. Perturbation thresholds

2.1 Definitions

The right to privacy is a fundamental human right, as evidenced by its inclusion in the Universal Declaration of Human Rights and the European Convention on Human Rights. In the digital sphere, data privacy is enforced through multiple national and international regulations, such as the General Data Protection Regulation (GDPR) in Europe, the California Consumer Privacy Act (CCPA) in the United States or China’s Personal Information Protection Law (PIPL). Although those regulations differ in both scope and interpretation, their common principle is that individuals should remain in control of their own data. In particular, the processing of personal data must have a legal ground, and cannot be shared to third parties without the explicit and informed consent of the person(s) the data refers to.

One alternative strategy is to anonymize data to ensure the data is no longer personal, and therefore out of the scope of privacy regulations. Anonymization, according to the GDPR, refers to the complete and irrevocable removal of all information that may directly or indirectly lead to re-identification. However, as shown by Weitzenboeck et al. (2022), transforming the data to make it completely anonymous is almost impossible to achieve in practice for unstructured data such as text, unless the content of the text is radically altered, or the original source of the document is deleted.

Although complete anonymization is hard to attain, text sanitization is a crucial tool to adhere with the general requirement of data minimization which is enshrined in GDPR and most privacy regulations (Goldsteen et al., 2021). The principle of data minimization states that one should only collect and retain the personal data that is strictly necessary to fulfill a given purpose.

The process of editing text documents to conceal the identity of a person has a somewhat confusing terminology (Lison et al., 2021; Pil´an et al., 2022). The GDPR makes use of the term pseudonymization to denote a process of transforming data to conceal at least some personal identifiers, but in a way that does not amount to complete anonymization. The term de-identification is also common (Chevrier et al., 2019; Johnson et al., 2020), especially for work on medical patient records. De-identification approaches are typically restricted to the recognition of predefined entities, such as the categories of HIPAA (2004). In contrast, we define text sanitization as the process of detecting and masking any type of personal information in a text document that can lead to identification of the individual whose identity we wish to protect.

Text sanitization is a topic of investigation in several research fields, notably in natural language processing (NLP) and in privacy-preserving data publishing (PPDP). Approaches to text rewriting based on differential privacy have also been proposed. We review below those approaches.

2.2 NLP Approaches

NLP approaches to text sanitization have mainly focused on sequence labelling approaches, inspired by the large body of work on Named Entity Recognition. Such approaches aim at the detection of text spans containing personal identifiers (Chiu and Nichols, 2016; Lample et al., 2016). Most research works in this field to date have focused on the medical domain, where the Health Insurance Portability and Accountability Act of 1996 (HIPAA, 2004) offers concrete rules that allow for the standardization of this task. HIPAA defines a set of Protected Health Information (PHI) data types that encompass direct identifiers (such as names or social security numbers) as well as domain-specific demographic attributes including treatments received and health conditions. A wide variety of NLP methods have been developed for this task, including rule-based, machine learning-based and hybrid approaches (Sweeney, 1996; Neamatullah et al., 2008; Yang and Garibaldi, 2015; Yogarajan et al., 2018). Character-based recurrent neural networks (Dernoncourt et al., 2017; Liu et al., 2017) and transformer architectures have also been investigated for this purpose (Johnson et al., 2020). A recent initiative focused on replacing sensitive information is INCOGNITUS (Ribeiro et al., 2023), a clinical note de-identification tool. The system allows for redacting documents with either a NER-based method or with an embedding based approach substituting all tokens with a semantically related one. Recent large language models from the GPT family have also been explored. Liu et al. (2023) proposed DeID-GPT for masking PHI categories and showed that, with zero-shot in-context learning incorporating explicitly HIPAA requirements in the prompts, GPT-4 outperformed fine-tuned transformer models on the same annotated medical texts.

Text sanitization outside the medical domain includes approaches such as JuezHernandez et al. (2023), who propose AGORA, a document de-identification system combined with geoparsing (automatic location extraction from text) using LSTMs and CRFs and trained on Spanish law enforcement data. The authors focus on offering a complete pipeline and location information, while demographic attributes are not part of the information to de-identify. Yermilov et al. (2023) compared three systems for detecting and pseudonymizing PII: (1) a NER-based one relying on Wikidata; (2) a single-step sequence-to-sequence model trained on a parallel corpus; and (3) a large language model where named entities are first detected using a 1-shot prompt to GPT-3 and then pseudonymized with 1-shot prompts using ChatGPT (GPT-3.5). The authors find that the NER-based approach is best for preserving privacy while LLMs best preserve utility for a text classification and summarization tasks. Finally, Papadopoulou et al. (2022) present an approach to text sanitization, from detection of personal information to privacy risk estimation through the use of language model probabilities, web queries, and a classifier trained on manually labeled data. The present paper builds upon this work.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article Save $230 on the Samsung Galaxy S25 Ultra, our favorite high-end phone
Next Article Amazon Launches First 27 Project Kuiper Internet Satellites
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Implementing Zero Trust Security in Cloud-Native Environments by Shashi Prakash Patel | HackerNoon
Computing
Act fast! This Breville espresso machine is randomly $200 off right now — and it has a touchscreen
News
Dozens of drivers in major US city got licenses by paying cash & skipping test
News
Intel Enables Wildcat Lake Display & Experimental Flip Queue For Linux 6.17 Graphics
Computing

You Might also Like

Computing

Implementing Zero Trust Security in Cloud-Native Environments by Shashi Prakash Patel | HackerNoon

7 Min Read
Computing

Intel Enables Wildcat Lake Display & Experimental Flip Queue For Linux 6.17 Graphics

3 Min Read
Computing

Mecha BREAK launches globally, but faces player criticism · TechNode

4 Min Read
Computing

South Africans ditch cash and cards for digital payments

6 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?