By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: How to Develop a Privacy-First Entity Recognition System | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > How to Develop a Privacy-First Entity Recognition System | HackerNoon
Computing

How to Develop a Privacy-First Entity Recognition System | HackerNoon

News Room
Last updated: 2025/04/28 at 8:40 PM
News Room Published 28 April 2025
Share
SHARE

Authors:

(1) Anthi Papadopoulou, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway and Corresponding author ([email protected]);

(2) Pierre Lison, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(3) Mark Anderson, Norwegian Computing Center, Gaustadalleen 23A, 0373 Oslo, Norway;

(4) Lilja Øvrelid, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway;

(5) Ildiko Pilan, Language Technology Group, University of Oslo, Gaustadalleen 23B, 0373 Oslo, Norway.

Table of Links

Abstract and 1 Introduction

2 Background

2.1 Definitions

2.2 NLP Approaches

2.3 Privacy-Preserving Data Publishing

2.4 Differential Privacy

3 Datasets and 3.1 Text Anonymization Benchmark (TAB)

3.2 Wikipedia Biographies

4 Privacy-oriented Entity Recognizer

4.1 Wikidata Properties

4.2 Silver Corpus and Model Fine-tuning

4.3 Evaluation

4.4 Label Disagreement

4.5 MISC Semantic Type

5 Privacy Risk Indicators

5.1 LLM Probabilities

5.2 Span Classification

5.3 Perturbations

5.4 Sequence Labelling and 5.5 Web Search

6 Analysis of Privacy Risk Indicators and 6.1 Evaluation Metrics

6.2 Experimental Results and 6.3 Discussion

6.4 Combination of Risk Indicators

7 Conclusions and Future Work

Declarations

References

Appendices

A. Human properties from Wikidata

B. Training parameters of entity recognizer

C. Label Agreement

D. LLM probabilities: base models

E. Training size and performance

F. Perturbation thresholds

4 Privacy-oriented Entity Recognizer

Identifying PII spans is the first step in text sanitization. Although many methods rely on some variant of NER, they fail to detect PII spans that are not named entities but are nevertheless (quasi-)identifying.

We detail here our approach to detecting text spans expressing personal information. The approach uses knowledge graphs such as Wikidata to create gazetteers for specific PII types. Those gazetteers are then combined with a NER model to create a domain-specific silver corpus, which is in turn employed to fine-tune a neural sequence labelling model. This approach to developing a “privacy-oriented entity recognizer” builds upon earlier work by Papadopoulou et al. (2022), and provides additional details on various aspects of the gazetteer construction process, model training and empirical evaluation.

Table 2: Selected examples of Wikidata properties of type DEM or MISC.Table 2: Selected examples of Wikidata properties of type DEM or MISC.

4.1 Wikidata Properties

NER models are, as the term indicates, focused on named entities. However, many instances of the DEM and MISC[1] categories described in the previous section are not named entities. Examples include someone’s occupation, educational background, part of their physical appearance, the manner of their death or an object that is tied to their identity.

We extract a list of possible values for these two PII categories based on knowledge graphs. In particular, Wikidata[2] is a structured knowledge graph containing information in property-value pairs, with a large number of values being adjectives, nouns, or noun phrases. We operated by retrieving all instances of humans from the Wikidata dump file, and inspecting Wikidata properties[3] to select those that seems to express either DEM or MISC PII based on their description and their examples.

After filtering, we end up with 44 DEM properties and 196 MISC properties. Selected examples of each semantic type are shown Table 2, while a detailed table can be found in Appendix A. Some properties were left out due to the high level of false positives they might have introduced if included (e.g. blood type (P1853)) or because they mostly contained named entities that would already be detected by a generic NER model.

We then use these properties to traverse the Wikidata instances and save all values into two gazetteers, one for DEM entities[4] and one for MISC entities.

4.2 Silver Corpus and Model Fine-tuning

A silver corpus of 5000 documents is then compiled, consisting in our experiments with the datasets of Section 3 of 2500 European Court of Human Rights cases and 2500 Wikipedia summaries (Lebret et al., 2016). To automatically label the documents, we first run a generic NER model5 to detect named entities. We then apply the two DEM and MISC gazetteers and tag each match with their respective label. In case of overlap, we keep the longest span, e.g. keep “Bachelor in Computer Science” instead of “Bachelor” and “Computer Science” as two separate spans.

Table 3: Token-level precision (P), recall (R) and F1 score per semantic type on the test sets of the Wikipedia biographies and TAB corpus. We also report micro-averaged performance scores under two conditions: one where we require exact matches on the predicted label, and one where we only distinguish between PII-tokens and non-PII-tokens (thus conflating all PII types into one group).Table 3: Token-level precision (P), recall (R) and F1 score per semantic type on the test sets of the Wikipedia biographies and TAB corpus. We also report micro-averaged performance scores under two conditions: one where we require exact matches on the predicted label, and one where we only distinguish between PII-tokens and non-PII-tokens (thus conflating all PII types into one group).

We then employ this silver corpus to fine-tune a RoBERTa (Liu et al., 2019) model, thus creating a privacy-oriented entity recognizer. Detailed training parameters can be found in Table 10 in Appendix B.


[1] It should be noted that the MISC category employed in this paper does not equate to the MISC category from CoNLL-2003 shared task (Tjong Kim Sang and De Meulder, 2003), which is characterized as a named entity (denoted with a proper name) that is neither a person, organization or place.

[2] https://www.wikidata.org

[3] https://www.wikidata.org/wiki/Wikidata:Database reports/List of properties/all

[4] We also manually add country names and nationalities into the DEM gazetteer to account for cases when the NER failed to detect those and the gazetteer lacked this information.

[5] We used here a RoBERTa model fine-tuned on the Ontonotes v5 corpus using spaCy’s implementation.

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article NYT Connections today hints and answers — Tuesday, April 29 (#688)
Next Article Duolingo will replace contract workers with AI
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

The Best Cheap Wi-Fi Routers We’ve Tested (July 2025)
News
The best handheld gaming consoles, from the Nintendo Switch to the Steam Deck
News
👨🏿‍🚀 Daily – Starlink sets up shop in Lagos |
Computing
DOJ lack of TikTok ban enforcement appears to be due to broad Article II interpretation
News

You Might also Like

Computing

👨🏿‍🚀 Daily – Starlink sets up shop in Lagos |

14 Min Read
Computing

How I cracked connecting my phone to my smart TV |

8 Min Read
Computing

The HackerNoon Newsletter: Is Generative AI a Blessing in Disguise for Journalism? (7/3/2025) | HackerNoon

3 Min Read
Computing

Perl 5.42 Released With New Operators, Unicode 16 Support, Security Fixes

1 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?