Meta has recently introduced data logs as part of their Download Your Information (DYI) tool, enabling users to access additional data about their product usage. This development was aimed at enhancing transparency and user control over personal data.
A blog post on Meta’s engineering blog summarised the journey. The implementation of data logs presented challenges due to the scale of Meta’s operations and the limitations of their data warehouse system, Hive. The primary challenge was the inefficiency of querying Hive tables, which are partitioned by date and time, requiring a scan of every row in every partition to retrieve data for a specific user. With over 3 billion monthly active users, this approach would process a humongous amount of irrelevant data for each query.
Meta developed a system that amortizes the cost of expensive full table scans by batching individual users’ requests into a single scan. This method provides sufficiently predictable performance characteristics to make the feature feasible with some levy of processing irrelevant data.
The current design utilizes Meta’s internal task-scheduling service to organize recent requests for users’ data logs into batches. These batches are submitted to a system built on the Core Workflow Service (CWS), which ensures reliable execution of long-running tasks. The process involves copying user IDs into a new Hive table, initiating worker tasks for each data logs table, and executing jobs in Dataswarm, Meta’s data pipeline system.
The jobs perform an INNER JOIN between the table containing requesters’ IDs and the column identifying the data owner in each table. This operation produces an intermediate Hive table containing combined data logs for all users in the current batch. PySpark then processes this output to split it into individual files for each user’s data in a given partition.
The resulting raw data logs are further processed using Meta’s Hack language to apply privacy rules and filters, rendering the data into meaningful, well-explained HTML files. Finally, the results are aggregated into a ZIP file and made available through the DYI tool.
Source: Data logs: The latest evolution in Meta’s access tools
Throughout the development process, Meta learned some important lessons. They found it important to implement robust checkpointing mechanisms to enable incremental progress and resilience against errors and temporary failures. This leads to increasing overall system throughput by allowing work to resume piecemeal in case of issues like job timeouts or memory-related failures.
When ensuring data correctness, Meta encountered a Spark concurrency bug that could have led to data being returned to the wrong user. To address this, they implemented a verification step in the post-processing stage to ensure that the user ID in the data matches the identifier of the user whose logs are being generated.
The complexity of the data workflows required advanced tools and the ability to iterate on code changes quickly. Meta then built an experimentation platform that allows for running modified versions of workflows and independently executing phases of the process to create faster cycles of testing and development.
Hardik Khandelwal, Software Engineer III at Google appreciated the engineering principles behind data logs, mentioning,
What stands out to me is how solid software engineering principles enabled this at scale:
– Batching requests to efficiently query massive datasets.
– Checkpointing to ensure incremental progress and fault tolerance.
– Security checks to enforce privacy rules and prevent data leakage.
This system was a massive engineering challenge—querying petabytes of data from Hive without overwhelming infrastructure.
Meta was also in the news recently as it announced the Automated Compliance Hardening (ACH) tool, and open-sourced Large Concept Model (LCM). ACH tool is a mutation-guided, LLM-based test generation system. LCM is a language model designed to operate at a higher abstraction level than tokens.
Meta also emphasized the importance of making the data consistently understandable and explainable to end-users. This involves collaboration between access experts and specialist teams to review data tables, ensuring that sensitive information is not exposed and that internal technical jargon is translated into user-friendly terms.
Finally, the processed content is implemented in code using renderers that transform raw values into user-friendly representations. This includes converting numeric IDs into meaningful entity references, converting enum values into descriptive text, and removing technical terms.