Authors:
(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author ([email protected]);
(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;
(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.
Table of Links
Abstract and 1 Introduction
2 Background
2.1 Definitions
2.2 NLP Approaches
2.3 Privacy-Preserving Data Publishing
2.4 Differential Privacy
3 Datasets and 3.1 Text Anonymization Benchmark (TAB)
3.2 Wikipedia Biographies
4 Privacy-oriented Entity Recognizer
4.1 Wikidata Properties
4.2 Silver Corpus and Model Fine-tuning
4.3 Evaluation
4.4 Label Disagreement
4.5 MISC Semantic Type
5 Privacy Risk Indicators
5.1 LLM Probabilities
5.2 Span Classification
5.3 Perturbations
5.4 Sequence Labelling and 5.5 Web Search
6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics
6.2 Experimental Results and 6.3 Discussion
6.4 Combination of Risk Indicators
7 Conclusions and Future Work
Declarations
References
Appendices
A. Human properties from Wikidata
B. Training parameters of entity recognizer
C. Label Agreement
D. LLM probabilities: base models
E. Training size and performance
F. Perturbation thresholds
2.3 Privacy-Preserving Data Publishing
PPDP approaches to text sanitization rely on a privacy model specifying formal conditions that must be fulfilled to ensure the data can be shared without harm to the privacy of the registered individuals. The most prominent privacy model is k-anonymity (Samarati and Sweeney, 1998), which requires that an individual/entity be indistinguishable from k -1 other individuals/entities. This model was subsequently adapted to text data by approaches such as k- safety (Chakaravarthy et al., 2008) and k-confusability (Cumby and Ghani, 2011).
t-plausibility (Anandan et al., 2012) follows a similar approach, using already detected personal information and ensuring that those are sufficiently generalized to ensure that at least t documents can be mapped to the edited text. Sanchez and Batet (2016) present C-sanitized, which relies on an information-theoretic approach that computes the point-wise mutual information (using co-occurrence counts from web data) between the person or entity to protect and the terms of the document. Terms whose mutual information ends up above a given threshold are then masked.
k-anonymity was also employed in Papadopoulou et al. (2022) in combination with NLP-based approaches, where based on an assumption of an attacker’s knowledge, the optimal set of masking decisions was found to ensure k-anonymity.
Finally, Manzanares-Salor et al. (2022) provided an approach to the evaluation of disclosure risks that relies on training a text classifier to assess the difficulty of inferring the identity of the individual in question based on the sanitized text.
2.4 Differential Privacy
Differential privacy (DP) is a framework for ensuring the privacy of individuals in datasets (Dwork et al., 2006). It essentially operates by producing randomized responses to queries. The level of artificial noise introduced in each response is optimized such as to provide a guarantee that the amount of information that can be learned about any individual remains under a given threshold.
Fernandes et al. (2019) applied DP to text data, in combination with ML techniques by adding noise to the word embeddings of the model. Their work focused on removing stylistic cues from the text as a way to ensure that the author of the text could not be identified by it. Feyisetan et al. (2019) also apply noise to word embeddings in a setting where the geolocation data of an individual is to be protected.
More recently, Sasada et al. (2021) tried to address the issue of the noise needed for DP causing utility loss in the resulting text by creating duplicates first, and then adding noise, thus reducing the amount of noise needed. Krishna et al. (2021) sought to address the same issue using an algorithm based on auto-encoders to transform text without losing data utility. Finally, Igamberdiev and Habernal (2023) introduced DPBART, a DP rewriting system based on pre-trained BART model, and which seeks to reduce the amount of artificial noise needed to reach a given privacy guarantee.
DP-oriented approaches generally lead to complete transformations of the text, at least for reasonable values of the privacy threshold. Those approaches are therefore well suited to the generation of synthetic texts, in particular to collect training data for machine learning models. However, they are difficult to apply to text sanitization, as most text sanitization problems are expected to retain the core content of the text and only edit out the personal identifiers. This is particularly the case for court judgments and medical records, as the sanitized documents should not alter the wording and semantic content conveyed in the text.