By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
World of SoftwareWorld of SoftwareWorld of Software
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Search
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
Reading: What 100 GitHub Projects Reveal About Personal Data in Modern Software | HackerNoon
Share
Sign In
Notification Show More
Font ResizerAa
World of SoftwareWorld of Software
Font ResizerAa
  • Software
  • Mobile
  • Computing
  • Gadget
  • Gaming
  • Videos
Search
  • News
  • Software
  • Mobile
  • Computing
  • Gaming
  • Videos
  • More
    • Gadget
    • Web Stories
    • Trending
    • Press Release
Have an existing account? Sign In
Follow US
  • Privacy
  • Terms
  • Advertise
  • Contact
Copyright © All Rights Reserved. World of Software.
World of Software > Computing > What 100 GitHub Projects Reveal About Personal Data in Modern Software | HackerNoon
Computing

What 100 GitHub Projects Reveal About Personal Data in Modern Software | HackerNoon

News Room
Last updated: 2026/01/22 at 2:35 PM
News Room Published 22 January 2026
Share
What 100 GitHub Projects Reveal About Personal Data in Modern Software | HackerNoon
SHARE

Table Of Links

Abstract

1 Introduction

2 Background

3 Privacy-Relevant Methods

4 Identifying API Privacy-relevant Methods

5 Labels for Personal Data Processing

6 Process of Identifying Personal Data

7 Data-based Ranking of Privacy-relevant Methods

8 Application to Privacy Code Review

9 Related Work

Conclusion, Future Work, Acknowledgement And References

Application To Privacy Code Review

This section outlines how our approach can be applied to privacy code reviews across a diverse set of 100 open-source applications. We then delve into detailed case studies of two popular software applications to illustrate the utility of our approach.

Table 7. Top classes in Java for personal data processing with example privacy-relevant methods

8.1 Large-scale Analysis

To understand the prevalence and types of personal data processing in real-world applications, we analyzed 100 open-source applications. These were equally divided between Java and JavaScript/TypeScript and were selected from GitHub’s daily top-starred repositories list 3 . We selected applications that are popular (top-starred), non-trivial (over 300K lines of code), and predominantly written in Java or JavaScript/TypeScript (constituting over 60% of the codebase).

Additionally, we ensured these applications differed from the 30 popular libraries analyzed previously and that their primary documentation language was English for easier identification of functionalities. This selection process resulted in a dataset that is representative of real-world software applications and suitable for our analysis of personal data processing practices.

We then examined the proportion of methods in these applications that invoke privacy-relevant methods and are involved in the flow of personal data and Personally Identifiable Information (PII). The result of statistics of our findings are listed below in Table 8.

Table 8. Percentage of application methods invoking privacy-relevant methods and processing personaldata and PII

Our findings indicate that our approach can make the privacy code review process more efficient. By identifying methods that are critical for personal data and PII processing, we help reviewers focus their efforts, enabling a more targeted review.

8.2 In-Depth Case Studies

We validate the effectiveness of our approach through two open-source projects: Signal Desktop4 and Cal.com5 . Each offers unique insights for privacy code review. Both projects were chosen due to their popularity, sensitivity, and public availability. Their open codebases ensure transparency and reproducibility, making them ideal candidates to validate our approach.

By applying our approach to these carefully selected real-world projects, we provide concrete examples that demonstrate practical value in identifying key areas to focus on during privacy code reviews.

Signal Desktop Signal Desktop is a famous end-to-end encrypted messaging application, primarily written in TypeScript (79.5%) and JavaScript (15.6%), covering about 360K lines of code. Its reputation for enhanced security and privacy features showcases the depth of our approach. While the application has limited use of popular libraries, our approach highlighted a minor number of privacy-relevant methods invocations (48, approximately 0.5% of total methods) from our selected APIs and native libraries potentially linked to personal data processing.

In our analysis, Signal stands out for using its own encryption protocol (Signal Protocol) and message transmission services, minimally relying on external libraries. This underscores Signal’s commitment to end-to-end encryption. Our categorization highlights the primary areas of Data Processing and Transformation (DPT), Network Communication (NC), and Data Encryption and Cryptography (DEC), with most encryption methods used for local encryption of profiles and group data. Signal’s proprietary protocol, used for encrypting chats and attachments, falls outside our analysis scope.

Our findings show that Signal rarely transmits PII directly to the internet. Instead, encrypted system data or anonymized IDs are mainly used, reflecting Signal’s dedication to user privacy. For privacy code reviewers examining Signal Desktop, our approach underscores Signal’s limited use of popular libraries for PII processing, aligning with its privacy-focused design philosophy. This categorization helps reviewers understand how Signal handles personal data, aiding in a more streamlined review process.

Cal.com Cal.com, a scheduling application, is designed to grant users comprehensive control over their schedules. Written entirely in TypeScript, it spans about 126K lines of code. Our method identified 371 (approximately 3.8% of total methods) privacy-relevant methods that might engage in personal data processing.

Applications such as Cal.com often employ diverse frameworks for specific functionalities. For instance, Cal.com’s utilization of the popular ORM framework, Prisma, for handling user profiles and credentials, aligns with our library list. In terms of categories, Data Processing and Transformation (DPT) topped the list at 26%, followed by Identity and Access Management (IAM) at 17%, and Network Communication (NC) at 15%. Unlike Signal Desktop, Cal.com heavily leverages libraries like Prisma, next-auth, and nodemailer for processing personal data, mirroring its primary functions of user registration, email interaction, and scheduling.

Approximately 97% of privacy-relevant methods invoked by Cal.com handle PII. This attests to the capability of our method in identifying PII processing methods and subsequently guiding code reviewers efficiently. Our approach highlights the extensive use of specific libraries in applications like Cal.com, aligning with their core features. This correlation boosts reviewers’ confidence and precision. By categorizing processing activities, it provides an overview of how the application handles personal data, helping reviewers prioritize effectively. This makes the review process time-efficient and thorough.

8.3 Threats to Validity

Our study’s validity may be affected by several factors. The project selection based on GitHub trends could bias towards popular topics, potentially overlooking a broader range of applications. The use of Semgrep for static analysis, though efficient, hasn’t been thoroughly validated for precision, which could impact the accuracy of our results. Reliance on regular expression matching for identifying personal data risks introducing false positives and negatives, thus affecting result reliability.

Additionally, the absence of manual validation for each instance of personal data processing identified might lead to inaccuracies. Furthermore, focusing only on the top 25 libraries for Java and JavaScript due to resource constraints limits the generalizability of our findings, as other privacy-relevant methods in lesser-known libraries may have been missed.

:::info
Authors:

  1. Feiyang Tang
  2. Bjarte M. Østvold

:::

:::info
This paper is available on arxiv under CC BY-NC-SA 4.0 license.

:::

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.
By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Twitter Email Print
Share
What do you think?
Love0
Sad0
Happy0
Sleepy0
Angry0
Dead0
Wink0
Previous Article What to Expect at Samsung’s Galaxy S26 Unpacked Event What to Expect at Samsung’s Galaxy S26 Unpacked Event
Next Article Apple’s John Ternus Takes Over Design in Latest CEO Succession Move Apple’s John Ternus Takes Over Design in Latest CEO Succession Move
Leave a comment

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Stay Connected

248.1k Like
69.1k Follow
134k Pin
54.3k Follow

Latest News

Apple spent a record  million on U.S. lobbying in 2025 – 9to5Mac
Apple spent a record $10 million on U.S. lobbying in 2025 – 9to5Mac
News
Spotify's New AI Tool Will Create Playlists Based on Vibes
Spotify's New AI Tool Will Create Playlists Based on Vibes
News
AT&T ‘Turbo Live’ Offers Priority Service at Crowded, Maxed-Out Venues
AT&T ‘Turbo Live’ Offers Priority Service at Crowded, Maxed-Out Venues
News
ASUS Chromebook CZ for education
ASUS Chromebook CZ for education
Mobile

You Might also Like

Seattle startup Overland AI partners with CAL FIRE to use self-driving 4-wheelers for wildfire response
Computing

Seattle startup Overland AI partners with CAL FIRE to use self-driving 4-wheelers for wildfire response

3 Min Read
Linux GPU Driver Loophole Being Fixed For Unprivileged Users Being Able To Tap Unbounded Kernel Memory
Computing

Linux GPU Driver Loophole Being Fixed For Unprivileged Users Being Able To Tap Unbounded Kernel Memory

1 Min Read
Agentic AI Is Forcing Organizations to Rethink How Work Is Designed | HackerNoon
Computing

Agentic AI Is Forcing Organizations to Rethink How Work Is Designed | HackerNoon

0 Min Read
How Static and Hybrid Analysis Can Cut Privacy Review Effort by 95% | HackerNoon
Computing

How Static and Hybrid Analysis Can Cut Privacy Review Effort by 95% | HackerNoon

8 Min Read
//

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

  • Privacy Policy
  • Terms of use
  • Advertise
  • Contact

Topics

  • Computing
  • Software
  • Press Release
  • Trending

Sign Up for Our Newsletter

Subscribe to our newsletter to get our newest articles instantly!

World of SoftwareWorld of Software
Follow US
Copyright © All Rights Reserved. World of Software.
Welcome Back!

Sign in to your account

Lost your password?