Authors:
(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author ([email protected]);
(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;
(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;
(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.
Table of Links
Abstract and 1 Introduction
2 Background
2.1 Definitions
2.2 NLP Approaches
2.3 Privacy-Preserving Data Publishing
2.4 Differential Privacy
3 Datasets and 3.1 Text Anonymization Benchmark (TAB)
3.2 Wikipedia Biographies
4 Privacy-oriented Entity Recognizer
4.1 Wikidata Properties
4.2 Silver Corpus and Model Fine-tuning
4.3 Evaluation
4.4 Label Disagreement
4.5 MISC Semantic Type
5 Privacy Risk Indicators
5.1 LLM Probabilities
5.2 Span Classification
5.3 Perturbations
5.4 Sequence Labelling and 5.5 Web Search
6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics
6.2 Experimental Results and 6.3 Discussion
6.4 Combination of Risk Indicators
7 Conclusions and Future Work
Declarations
References
Appendices
A. Human properties from Wikidata
B. Training parameters of entity recognizer
C. Label Agreement
D. LLM probabilities: base models
E. Training size and performance
F. Perturbation thresholds
Abstract
Text sanitization is the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it. In this paper, we consider a two-step approach to text sanitization and provide a detailed analysis of its empirical performance on two recently published datasets: the Text Anonymization Benchmark (Pilan et al., 2022) and a collection of Wikipedia biographies (Papadopoulou et al., 2022). The text sanitization process starts with a privacy-oriented entity recognizer that seeks to determine the text spans expressing identifiable personal information. This privacy-oriented entity recognizer is trained by combining a standard named entity recognition model with a gazetteer populated by person related terms extracted from Wikidata. The second step of the text sanitization process consists in assessing the privacy risk associated with each detected text span, either isolated or in combination with other text spans. We present five distinct indicators of the re-identification risk, respectively based on language model probabilities, text span classification, sequence labelling, perturbations, and web search. We provide a contrastive analysis of each privacy indicator and highlight their benefits and limitations, notably in relation to the available labeled data.
1 Introduction
The volume of text data available online is continuously increasing and constitutes an essential resource for the development of large language models (LLM). This trend has important privacy implications. Most text documents do, indeed, contain personally identifiable information (PII) in one form or another – that is, information that can directly or indirectly lead to the identification of a human individual. PII can be split in two main categories (Elliot et al., 2016):
Direct identifiers, which are information that can directly and irrevocably identify an individual. This includes for example a person’s name, passport number, email address, phone number, username or home address.
Quasi identifiers, also called indirect identifiers, which are not per se sufficient to single out an individual, but may do so when combined with other quasi-identifiers. Examples of quasi-identifiers include the person’s nationality, occupation, gender, place of work or date of birth. A well-known illustration of the privacy risks posed by quasi-identifiers comes from Golle (2006) who showed that between 63% to 78% of the US population could be uniquely identified by merely combining their gender, date of birth and zip code.
Although some PII may be benign (if the text conveys e.g. trivial or public information), it is far from always being the case. Documents such as court judgments, medical records or interactions with social services are often highly sensitive, and their disclosure may have catastrophic consequences for the individuals involved. Editing those documents to conceal the identity of those individuals is thus often desirable from both an ethical and legal perspective. This is precisely the goal of text sanitization (Chakaravarthy et al., 2008; Sanchez and Batet, 2016; Lison et al., 2021).
In this paper, we present a two-step approach to the problem of text sanitization and provide a detailed analysis of its performance, based on two recently released datasets (Pil´an et al., 2022; Papadopoulou et al., 2022). The approach first seeks to detect text spans that convey personal information, using a neural sequence labelling model that combines named entity recognition (NER) with a gazetteer. The gazetteer is built by inferring from Wikidata a set of properties typically used to characterize human individuals, such as “position held” or “manner of death”, and traversing the knowledge graph to extract all possible values for those properties. In the second step, the privacy risk of the detected text spans is assessed according to various indicators. Finally, the text spans that are deemed to constitute an unacceptable risk according to those indicators are masked. An overview of the approach is illustrated in Figure 1.
The paper makes the following four contributions:
-
A privacy-oriented entity recognizer making it possible to detect PII information beyond named entities, using large lists of person-related terms extracted from Wikidata, extending previous work by Papadopoulou et al. (2022).
-
A quantitative and qualitative evaluation of the performance of this privacy-oriented entity recognizer for various types of PII.
-
The development of five privacy risk indicators, based on LLM probabilities, span classification, perturbations, sequence labelling and web search.
4. A comparative analysis of those privacy risk indicators, both in isolation and in combination, on the two aforementioned datasets.
The paper consists of the following sections. Section 2 provides a general background on data privacy and text sanitization. We then describe in Section 3 the two annotated datasets for text sanitization employed for the empirical analysis of this paper. In Section 4 we present, evaluate and discuss a privacy-oriented entity recognizer for the task of detecting PII in text documents. We present the set of five privacy risk indicators in Section 5. The empirical performance of those indicators on the two datasets is then analysed in Section 6. We conclude in Section 7.