How to Save More Lives and Avoid the Privacy Apocalypse

In the mid-1990s, the Massachusetts Insurance Commission, the insurer of government employees, released medical data to researchers describing millions of interactions between patients and the healthcare system. Such records can easily reveal highly sensitive information – psychiatric consultations, sexually transmitted infections, painkiller addictions, bedwetting – not to mention the exact timing of each treatment. So, naturally, the GIC removed names, addresses, and social security information from the records. Safely anonymous, they could be used to answer vital questions about what treatments work best and at what cost.

LaTanya Sweeney didn’t see it that way. Then, as a graduate student and now a professor at Harvard University, Sweeney noticed that most combinations of gender and date of birth (about 60,000 of them) are unique in every 25,000 wide ZIP code. The vast majority of people could be uniquely identified by matching voter records with anonymous medical records. For example, only one medical record showed the same date of birth, gender, and zip code as that of then-Massachusetts Governor William Weld. Sweeney unmistakably confirmed her point by sending Weld a copy of his own supposedly anonymous medical records.

There are many such stories in botanist circles. Large datasets can be easily deanonymized; this fact is as obvious to data scientists as it is surprising to non-specialists. The more detailed the data, the easier and more consistent deanonymization becomes.

But this particular problem has an equal and opposite possibility: the better the data, the more useful it is in saving lives. Reliable data can be used to evaluate new treatments, identify emerging supply issues, improve quality, and assess who is most at risk for side effects. However, seizing this opportunity without triggering a privacy apocalypse and a justified backlash from patients seems impossible.

That’s not the case, says Professor Ben Goldacre, director of the Bennett Institute for Applied Data Science at the University of Oxford. Goldacre recently conducted a review of the use of UK health data for research which proposed a solution. “It’s almost unique,” he told me. “A real opportunity to get your cake and eat it.” The British government likes this fiction, and seems to have taken Goldacre’s recommendations with pleasure.

At the moment we have the worst of both worlds: researchers are struggling to access data because people who have patient records (fairly) are hesitant to share them. However, leaks are almost inevitable because there is patchy control over who has what data and when.

What does the Goldacre review suggest? Instead of emailing millions of patient records to anyone who promises to be good, the records will be stored in a secure data vault. An approved research team that wants to understand, say, the severity of a new variant of Covid in vaccinated, unvaccinated, and previously infected people, will write analytic code and test it on dummy data until it is proven to work successfully. When it’s ready, the code will be submitted to the data store and the results will be returned. The researchers will never see the original data. In the meantime, the entire research community could see that the code had been deployed and could review, share, reuse, and adapt it.

This approach is called a Trusted Research Environment or TRE. The concept is not new, says Ed Chalstry, a data scientist at the Alan Turing Institute. The Office for National Statistics has a TRE called the Secure Research Service that allows researchers to safely analyze census data. Goldacre and his colleagues developed another one called OpenSAFELY. What’s new, Chalstry says, is the huge datasets that are now becoming available, including genomic data. Deanonymization in such cases is simply hopeless, and the opportunity they provide is golden. Thus, the time seems to be ripe for more widespread use of TRE.

The Goldacre Review recommends that the UK create a more trusted research environment with four goals: to gain justifiable trust from patients, to allow researchers to analyze data without waiting years for permission, to make validation and sharing of analytical tools something that happens by design, as well as to develop a community of research professionals. data.

The National Health Service has an enviably extensive collection of patient records. But can he build TRE platforms? Or will the government simply outsource the project in bulk to some tech giant? Top-down outsourcing would do little for patient trust or the sharing of open-source academic tools. Goldacre’s review states that “there is no single contract that can transfer responsibility to some external machine. Building great platforms should be seen as a core activity in its own right.”

Inspirational stuff, even if the history of government data projects isn’t exactly reassuring. But the opportunity is clear enough: a new kind of data infrastructure that will protect patients, speed up research, and help create a community of health data scientists that the world would envy. If it works, people will send letters of thanks to the Minister of Health rather than his own medical records.

Written and first published in Financial Times as of July 1, 2022.

Paperback book The Next 50 Things That Created the Modern Economy. now out in the UK.

“Infinitely insightful and full of surprises, exactly what you would expect from Tim Harford.” – Bill Bryson

“Witty, informative, and endlessly entertaining, this is popular economics at its most entertaining.” – The Daily Mail.

I created a showcase at the Bookstore in United States and United Kingdom — look and see all my recommendations; The bookstore was created to support local independent retailers. Links to Bookshop and Amazon can generate referral fees.