By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: 5 Ways to Measure Privacy Risks in Text Data | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > 5 Ways to Measure Privacy Risks in Text Data | HackerNoon
Computing

5 Ways to Measure Privacy Risks in Text Data | HackerNoon

News Room
Last updated: 2025/04/28 at 8:05 PM
News Room Published 28 April 2025
Share
SHARE

Authors:

(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author ([email protected]);

(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;

(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Definitions

2.2 NLP Approaches

2.3 Privacy-Preserving Data Publishing

2.4 Differential Privacy

3 Datasets and 3.1 Text Anonymization Benchmark (TAB)

3.2 Wikipedia Biographies

4 Privacy-oriented Entity Recognizer

4.1 Wikidata Properties

4.2 Silver Corpus and Model Fine-tuning

4.3 Evaluation

4.4 Label Disagreement

4.5 MISC Semantic Type

5 Privacy Risk Indicators

5.1 LLM Probabilities

5.2 Span Classification

5.3 Perturbations

5.4 Sequence Labelling and 5.5 Web Search

6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics

6.2 Experimental Results and 6.3 Discussion

6.4 Combination of Risk Indicators

7 Conclusions and Future Work

Declarations

References

Appendices

A. Human properties from Wikidata

B. Training parameters of entity recognizer

C. Label Agreement

D. LLM probabilities: base models

E. Training size and performance

F. Perturbation thresholds

5 Privacy Risk Indicators

Many text sanitization approaches simply operate by masking all detected PII spans. This may, however, lead to overmasking, as the actual risk of re-identification may vary greatly from one span to another. In many documents, a substantial fraction of the detected text spans may be kept in clear text without notably increasing the reidentification risk. For instance, in the TAB corpus, only 4.4 % of the entities were marked by the annotators as direct identifiers, and 64.4 % as quasi-identifiers, thus leaving 31.2 % of the entities in clear text. To assess which text spans should be masked, we need to design privacy risk indicators able to determine which text spans (or combination of text spans) actually constitute a re-identification risk.

We present 5 possible approaches for inferring the re-identification risk associated with text spans in a document. Those 5 indicators are respectively based on:

  1. LLM probabilities,

  2. Span classification,

  3. Perturbations,

  4. Sequence labelling,

  5. Web search.

The web search approach can be applied in a zero-short manner without any finetuning. The two methods respectively based on LLM probabilities and perturbations require a small number of labeled examples to adjust a classification threshold or fit a simple binary classification model. Finally, the span classification and sequence labelling approaches operate by fine-tuning an existing language model, and are thus the methods that are most dependent on getting a sufficient amount of labeled training data to reach peak performance. This training data will typically take the form of human decisions to mask or keep in clear text a given text span.

We present each method in turn and provide an evaluation and discussion of their relative benefits and limitations in Section 6.

5.1 LLM Probabilities

The probability of a span as predicted by a language model is inversely correlated with its informativeness: a text span that is harder to predict is more informative/surprising than one that the language model can easily infer from the context (Zarcone et al., 2016). The underlying intuition is that text spans that are highly informative/surprising are also associated with a high re-identification risk, as they often correspond to specific names, dates or codes that cannot be predicted from the context.

Concretely, we calculate the probability of each detected PII span given the entire context of the text by masking all the (sub)words of the span, and returning a list with all log-probabilities (one per token) referring to the span as calculated by a large, bidirectional language model, in our case BERT (large, cased) (Devlin et al., 2019). Those probabilities are then aggregated and employed as features for a binary classifier that outputs the probability of the text span being masked by a human annotator.

The list of log-probabilities for each span has a different length depending on the number of tokens comprised in the text span. We therefore aggregate this list by reducing it to 5 features, namely the minimum, maximum, median and mean log probability as well as the sum of log-probabilities in the list. In addition, we also include as feature an encoding of the PII type assigned to the span by the human annotators (see Table 1).

For the classifier itself, we conduct experiments using a simple logistic regression model as well as a more advanced classification framework based on AutoML. AutoML (He et al., 2021) provides a principled approach to hyper-parameter tuning and model selection, efficiently searching for the hyper-parameters and model combination that yield the best performance. We use the Autogluon toolkit (Erickson et al., 2020) in our experiments, and more specifically AutoGluon’s Tabular predictor (Erickson et al., 2020), which consists of an ordered sequence of base models. After each model is trained and hyper-parameter optimization is performed on each, a weighted ensemble model is trained using a stacking technique. A list of all base models of the Tabular predictor is shown in Table 11, in Appendix D. We use the training splits of the Wikipedia biographies and the TAB corpus to fit the classifier.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article House Passes Bill to Ban Sharing of Revenge Porn, Sending It to Trump
Next Article Arelion expands network presence in Mexico with DWDM route | Computer Weekly
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

These are 5 tricks I use and recommend to improve reading on any Android device
News
Fisherman missing feared dead after ‘being attacked by TWO SHARKS’
News
Log your weight in Apple Health quickly and cheaply with this iPhone-compatible smart scale – 9to5Mac
News
Sky is bringing back TWO channels tomorrow after they disappeared from TV guides
News

You Might also Like

Computing

How to Plan Your PR Calendar for 2025 (+Templates) |

32 Min Read
Computing

vs. Todoist: Which Productivity Tool Is Better? |

29 Min Read
Computing

The TechBeat: I Built an AI Copilot That Thinks in Exploits, Not Prompts (7/6/2025) | HackerNoon

5 Min Read
Computing

Fedora 43 Looks To Zstd-Compressed Initrd By Default For Space Savings & Faster Boots

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?