Researchers accidentally leak UK Biobank medical records on GitHub, risking volunteer privacy

Sensitive health data from the UK Biobank, a vast repository of medical records from 500,000 British volunteers, has repeatedly ended up exposed on public platforms like GitHub due to researchers' mistakes. A Guardian investigation uncovered dozens of such incidents, with UK Biobank issuing over 80 legal takedown notices to GitHub between July and December 2025 alone, targeting hundreds of repositories worldwide. According to a privacy researcher tracking these events on Hacker News, the organization has filed 110 DMCA notices so far, affecting 197 code repositories by 170 developers.

The leaks stem from researchers who gained approved access to de-identified UK Biobank data—containing genome sequences, scans, blood samples, lifestyle details, and hospital records—to study diseases like cancer, dementia, and diabetes. Until late 2024, scientists could download this data to their local computers for analysis using tools like R or Python. In their rush to share code on GitHub to meet journal and funder mandates for open science, many accidentally included untracked data files, such as CSV exports with hospital episode statistics. One prominent example involved a dataset covering diagnoses and surgery dates for over 413,000 participants, including details like gender, birth month and year, which lingered online until removal.

While no full names or addresses were exposed, the partial information raises serious re-identification risks. As reported by the Guardian, cross-referencing leaked hospital diagnoses with public details—like a volunteer's surgery history and birth details—allowed investigators to pinpoint specific records. This has sparked alarms about privacy in large-scale health research, especially as the UK government recently expanded Biobank access to GP records. European universities, key users of the resource, now face heightened scrutiny over data handling ethics and training gaps during the shift to secure cloud platforms like UK Biobank's Research Analysis Platform.

UK Biobank, founded in 2003 by the Department of Health and research charities, strictly prohibits sharing participant data outside its systems and requires researchers to sign confidentiality agreements. In response, the organization has ramped up measures: issuing takedown requests, launching a Git Audit Tool to scan repositories for exposures, publishing best-practice guides for GitHub use, and providing extra training. A BBC report also confirmed a related incident where data from 500,000 people was listed for sale in China, though no personally identifiable information was released.

These governance challenges underscore vulnerabilities in balancing open research with data security. The persistence of leaks, even after hundreds of removals, highlights how common practices like incomplete .gitignore configurations can undermine safeguards. Experts are calling for stronger anonymization standards, better enforcement, and more transparency to protect volunteers whose contributions fuel global medical breakthroughs.

Looking ahead, UK Biobank's ongoing audits and researcher education aim to curb future exposures, but the incidents expose broader tensions in an era of collaborative science. With data still appearing online despite efforts, affected volunteers and the research community await robust fixes to restore trust in this critical resource.