Today, National Institute of Standards and Technology (NIST) published its finalized Guidelines for Evaluating ‘Differential Privacy’ Guarantees to De-Identify Data (NIST Special Publication 800-226), a very important publication in the field of privacy-preserving machine learning (PPML). See: https://lnkd.in/gkiv-eCQ The Guidelines aim to assist organizations in making the most of differential privacy, a technology that has been increasingly utilized to protect individual privacy while still allowing for valuable insights to be drawn from large datasets. They cover: I. Introduction to Differential Privacy (DP): - De-Identification and Re-Identification: Discusses how DP helps prevent the identification of individuals from aggregated data sets. - Unique Elements of DP: Explains what sets DP apart from other privacy-enhancing technologies. - Differential Privacy in the U.S. Federal Regulatory Landscape: Reviews how DP interacts with existing U.S. data protection laws. II. Core Concepts of Differential Privacy: - Differential Privacy Guarantee: Describes the foundational promise of DP, which is to provide a quantifiable level of privacy by adding statistical noise to data. - Mathematics and Properties of Differential Privacy: Outlines the mathematical underpinnings and key properties that ensure privacy. - Privacy Parameter ε (Epsilon): Explains the role of the privacy parameter in controlling the level of privacy versus data usability. - Variants and Units of Privacy: Discusses different forms of DP and how privacy is measured and applied to data units. III. Implementation and Practical Considerations: - Differentially Private Algorithms: Covers basic mechanisms like noise addition and their common elements used in creating differentially private data queries. - Utility and Accuracy: Discusses the trade-off between maintaining data usefulness and ensuring privacy. - Bias: Addresses potential biases that can arise in differentially private data processing. - Types of Data Queries: Details how different types of data queries (counting, summation, average, min/max) are handled under DP. IV. Advanced Topics and Deployment: - Machine Learning and Synthetic Data: Explores how DP is applied in ML and the generation of synthetic data. - Unstructured Data: Discusses challenges and strategies for applying DP to unstructured data. - Deploying Differential Privacy: Provides guidance on different models of trust and query handling, as well as potential implementation challenges. - Data Security and Access Control: Offers strategies for securing data and controlling access when implementing DP. V. Auditing and Empirical Measures: - Evaluating Differential Privacy: Details how organizations can audit and measure the effectiveness and real-world impact of DP implementations. Authors: Joseph Near David Darais Naomi Lefkovitz Gary Howarth, PhD
Protecting Individuals in Public Datasets
Explore top LinkedIn content from expert professionals.
Summary
Protecting individuals in public datasets means using strategies and technologies that keep people's identities and personal details confidential when sharing or analyzing large collections of information. This involves techniques like anonymisation and differential privacy, which help prevent others from figuring out who is represented in the data, even when multiple sources are combined.
- Implement privacy safeguards: Use methods such as removing identifiers, adding statistical noise, or creating synthetic data to reduce the risk of someone being identified from a dataset.
- Control data access: Limit who can view, use, or combine datasets by setting strict permissions and monitoring data usage to prevent unauthorized re-identification.
- Educate and audit: Teach stakeholders about privacy risks and regularly check datasets for potential vulnerabilities or new ways someone might uncover personal information.
-
-
In a panel discussion some months ago, I said that anonymising personal data (PD) was a promising means of facilitating cross-border data flows, because once you can no longer identify an individual from the data, it should no longer fall within many countries' definition of PD, and would not be subject to export requirements. Someone asked if it was worth talking about anonymisation since it is reversible. I wasn't sure if it was a rhetorical question or philosophical question (if anonymisation is reversible, has there really been anonymisation?), or maybe we ascribed different meanings to the term, as do many jurisdictions. But in one of my LI posts I said that I would write more about it. This is what this post is about. When I use the term "anonymisation", I'm referring to the PDPC's definition of it i.e. the process of converting PD into data that cannot identify any particular individual, and can be reversible or irreversible. And to me, anonymisation is not like flipping a light switch such that you get light or no light, 1 or 0. There are factors that can affect re-identification of a person, e.g. - the number of direct and indirect identifiers in the dataset - what other data the organisation has access to (including publicly accessible info), which, when combined with the dataset, could re-identify the individual - what measures the organisation takes to reduce the possibility of re-identification e.g. restricting access to / disclosure of the dataset. So I think it is not so useful to think about anonymisation in binary terms. What I suggest we should think about is the possibility of re-identification. Think of a dimmer switch instead. When you see the cleartext dataset, that's when the light is on. When you start turning the dial down - the more direct and indirect identifiers you remove, the more safeguards you implement vis-a-vis the dataset - the dimmer the light gets, the possibility of re-identification is reduced. If you (a) remove all direct and indirect identifiers from a dataset, (b) encrypt it, and (c) only give need-to-know employees read access, I think the light's going to be pretty dim. It might no longer be PD, meaning that you can export it without being subject to export requirements. So yes, I think that anonymisation is a promising means of facilitating CBDF. Do note that you should also apply the "motivated intruder" test and consider if someone who has the motive to attempt re-identification, is reasonably competent, has access to appropriate resources e.g. internet, and uses investigative techniques, could re-identify the individual (see the ICO's excellent draft guidance https://lnkd.in/gmRfXj-W, and The Guardian's 2019 article https://lnkd.in/gZXfkHVC).
-
🚨BREAKING: Expert Report on LLMs The report by Isabel Barberá and Murielle Popa-Fabre analyses the risks to privacy and data protection posed by LLMs. It applies Con. 108+ for the Protection of Individuals with regard to Automatic Processing of Personal Data of the Council of Europe. 🚨 Findings: ‘privacy risks in LLM-based systems cannot be adequately addressed through ad-hoc organisational practices or existing compliance tools alone’, but a method to assess and mitigate risks must be deployed throughout the entire life-cycle of an #LLM - risk mitigation focuses on: ❌ LLM architecture: reduce size/context, deduplication of training dataset - less effective strategies ✅ Life-cycle: takes into account data-related and output risks, implements cybersecurity at all levels - and it’s in line with international standards! 🎙️In breaking down LLM tech, three data-usage phases can be identified: 1️⃣ Web-scraping and pretraining 2️⃣ Fine-tuning 3️⃣ Optimisation through data augmentation (RAG), agentic workflows 👉🏼 Best practices can be successfully implemented in Phase 2 - so that LLMs are privacy-fit when entering Phase 3, which involves vetting customer intentions and forming a working memory 🎙️ The report breaks down risks: 👉🏼 at #model level: LLMs define the relationship between data subject and personal data by the proximity that one data vector bears to the source vector for that data. Awareness of such relation is not implied, but statistical. Vector proximity depends on how multiple features relatable to a vector are aggregated in LLM training ‼️ Risks include: LLM pretraining through personal data scraped off the internet (no legal bases), data regurgitation, hallucinations, bias amplification 👉🏼 at #system level: depending on how LLMs interact with their environment - risks hog beyond privacy and impact upon autonomy, identity. Lastly, without human oversight, LLM-automated decisions defies Art. 9 of Con. 108+, while the likelihood of accurate profiling, also addressed in Art.9, becomes a threat given the amount of information that LLM are able to collect due to their increasing multimodal application ‼️ Risk management also takes into account user interference in interaction, post-deployment adaptations Risk mitigation evaluation framework: 📌 Reflect real-world deployment condition 📌 Multiple re-assessments (ISO 42005) 📌 Address emergent and interactive risks - not just performance metrics 📌 Involve stakeholders 📌 Accessible evaluation reports 💡 The RMEF should be piloted in a multi-stakeholder collaboration whereby an LLM is built, deployed, interacted with, assessed 🎙️Recommendations to stakeholders: 👉🏼 Work on data protection AND data safety: the two don’t equate 👉🏼 Implement privacy protection on day 0 👉🏼 Use PETs and implement data protection benchmarks 🚨 Regulators must issue clear guidance to help companies address these risks! CC: Peter Hense 🇺🇦🇮🇱 Itxaso Domínguez de Olazábal, PhD.
-
The EDPB recently published a report on AI Privacy Risks and Mitigations in LLMs. This is one of the most practical and detailed resources I've seen from the EDPB, with extensive guidance for developers and deployers. The report walks through privacy risks associated with LLMs across the AI lifecycle, from data collection and training to deployment and retirement, and offers practical tips for identifying, measuring, and mitigating risks. Here's a quick summary of some of the key mitigations mentioned in the report: For providers: • Fine-tune LLMs on curated, high-quality datasets and limit the scope of model outputs to relevant and up-to-date information. • Use robust anonymisation techniques and automated tools to detect and remove personal data from training data. • Apply input filters and user warnings during deployment to discourage users from entering personal data, as well as automated detection methods to flag or anonymise sensitive input data before it is processed. • Clearly inform users about how their data will be processed through privacy policies, instructions, warning or disclaimers in the user interface. • Encrypt user inputs and outputs during transmission and storage to protect data from unauthorized access. • Protect against prompt injection and jailbreaking by validating inputs, monitoring LLMs for abnormal input behaviour, and limiting the amount of text a user can input. • Apply content filtering and human review processes to flag sensitive or inappropriate outputs. • Limit data logging and provide configurable options to deployers regarding log retention. • Offer easy-to-use opt-in/opt-out options for users whose feedback data might be used for retraining. For deployers: • Enforce strong authentication to restrict access to the input interface and protect session data. • Mitigate adversarial attacks by adding a layer for input sanitization and filtering, monitoring and logging user queries to detect unusual patterns. • Work with providers to ensure they do not retain or misuse sensitive input data. • Guide users to avoid sharing unnecessary personal data through clear instructions, training and warnings. • Educate employees and end users on proper usage, including the appropriate use of outputs and phishing techniques that could trick individuals into revealing sensitive information. • Ensure employees and end users avoid overreliance on LLMs for critical or high-stakes decisions without verification, and ensure outputs are reviewed by humans before implementation or dissemination. • Securely store outputs and restrict access to authorised personnel and systems. This is a rare example where the EDPB strikes a good balance between practical safeguards and legal expectations. Link to the report included in the comments. #AIprivacy #LLMs #dataprotection #AIgovernance #EDPB #privacybydesign #GDPR
-
Unveiling 𝗜𝗱𝗲𝗻𝘁𝗶𝗳𝗶𝗮𝗯𝗶𝗹𝗶𝘁𝘆: Ever encounter the LINDDUN framework? It's privacy threat modeling's gold standard, with 'I' signifying Identifiability - a threat that can strip away the veil of anonymity, laying bare our private lives. A real-life instance: Latanya Sweeney re-identified a state governor's 'anonymous' medical records using public data and de-identified health records. Here, the supposed privacy fortress crumbled. Identifiability can compromise privacy, anonymity, and pseudonymity. A mere link between a name, face, or tag, and data can divulge a trove of personal info. So, what can go wrong? Almost everything. Designing a system or sharing dataset? Embed privacy into the core. Being a Data Privacy Engineer, consider these strategies: 1. Limit data collection. 2. Apply strong anonymization techniques. 3. Release pseudonymized datasets with legal protections. 4. Generate a synthetic dataset where applicable. 5. Audit regularly for re-identification vectors. 6. Educate stakeholders about risks and mitigation roles. Striking a balance between data utility and privacy protection is tricky but crucial for maintaining trust in our digitized realm. Reflect on how you're handling 'Identifiability'. Are your strategies sufficient? Bolster your data privacy defenses now.
-
This paper presents international consensus-based recommendations to address and mitigate potential harms caused by bias in data and algorithms within AI health technologies and health datasets. It emphasizes the need for diversity, inclusivity, and generalizability in these technologies to prevent the exacerbation of health inequities. 1️⃣ The paper emphasizes the importance of including a plain language summary in dataset documentation, detailing the data origin, and explaining the reasons behind the dataset's creation to help users assess its suitability for their needs. 2️⃣ It is highlighted that dataset documentation should provide a summary of the groups present, explain the categorization of these groups, and identify any groups that are missing or at risk of disparate health outcomes. 3️⃣ The paper addresses the need to identify and describe biases, errors, and limitations in datasets, including how missing data and biases in labels are handled, to improve the generalizability and applicability of the data. 4️⃣ The importance of adhering to data protection laws, ethical governance, and the involvement of patient and public participation groups in dataset documentation is stressed to ensure ethical use of health data. 5️⃣ Recommendations are provided for ensuring that datasets are used appropriately in the development of AI health technologies, including reporting on dataset limitations and evaluating the performance of AI technologies across different groups. ✍🏻 The STANDING Together collaboration. Recommendations for Diversity, Inclusivity, and Generalisability in Artificial Intelligence Health Technologies and Health Datasets. Version 1.0 published 30th October 2023. DOI: 10.5281/zenodo.10048356 Xiao Liu, MBChB PhD, Alastair Denniston, Joseph Alderman, Elinor Laws, Jaspret Gill, Neil Sebire, Marzyeh Ghassemi, Melissa McCradden, Melanie Calvert, Rubeta Matin, Dr Stephanie Kuku, MD, Jacqui Gath, Russell Pearson, Johan Ordish, Darren Treanor, Negar Rostamzadeh, Elizabeth Sapey, Stephen Pfohl, Heather Cole-Lewis, PhD, Francis Mckay, Alan Karthikesalingam MD PhD, Charlotte Summers, Lauren Oakden-Rayner, Bilal A Mateen, Katherine Heller, Maxine Mackintosh ✅ Sign up for my newsletter to stay updated on the most fascinating studies related to digital health and innovation: https://lnkd.in/eR7qichj
-
Privacy & data protection regulators from around the world have issued a joint statement on data scraping. This is a big deal... When a company engages in data scraping, they automatically 'hoover up' data from websites and use it for their own purposes. Individuals whose personal information is gathered this way generally have no opportunity to consent or object. As the Office of the Australian Information Commissioner said, data scraping technologies "raise significant privacy concerns as these technologies can be exploited for purposes including monetisation through reselling data to third-party websites, including to malicious actors, private analysis or intelligence gathering." For example, data scraping was at the heart of our privacy regulator's finding that facial recognition company, Clearview AI, breached Australian privacy law. In their statement on data scraping, the regulators emphasised four key points: 1. personal information on a public website is still subject to privacy law 2. social media companies and other website operators must protect personal information on their platforms from unlawful data scraping 3. mass data scraping incidents that harvest personal information can constitute reportable data breaches in many jurisdictions 4. individuals can also take steps to protect their personal information from data scraping, and social media companies have a role to play in enabling users to engage with their services in a privacy protective manner. The statement was endorsed by privacy regulators in Australia, Canada, the UK, Hong Kong, Mexico, New Zealand and elsewhere. https://lnkd.in/ey-5X9vw
-
HIPAA requires organizations handling sensitive health data to comply with its Privacy Rule and properly de-identify datasets before use. There are two main methods for de-identification, each with its own challenges: 1. 🔐 Safe Harbor: Removes 18 specific identifiers, providing a quicker process but significantly limiting data utility and analysis potential. The strict requirements hinder the usefulness of the dataset. 2. 📊 Expert Determination: Offers a flexible, risk-based approach that maintains data's analytical value while protecting privacy through various remediation techniques. However, the traditional process can be slow and resource-intensive, often taking weeks or months to complete. At Integral, we’ve accommodated both standards as configurations into our software so we can help you navigate the complexities of HIPAA compliance. We work closely with you to fine-tune our solution based on your unique business needs, striking the perfect balance between compliance and data utility.