How Synthetic Data Aids in Healthcare

Using synthetic data means important analysis and innovation can be done without associating particular people with their medical records.

Artsiom Balabanau

August 1, 2021

Finance and insurance companies have been leveraging synthetic data for many years to improve their workflows while ensuring information confidentiality. With the COVID-19 pandemic, scientists who are striving to find ways to combat the virus have considered synthetic data. How can this technology be of use in healthcare, and how does it help to cope with the pandemic?

What Synthetic Data Is

Without going into convoluted definitions, synthetic data is artificially generated data. It is similar to real data but doesn’t copy it. Synthetic data is generated automatically with the help of dedicated algorithms. It can be in the form of text, video, image, audio or information from tables.

Synthetic data can be applied in various areas. Waymo uses it to train its driverless cars. American Express uses artificially generated financial information to improve its fraud detection system. Synthetic data helps companies calculate risk accurately while protecting real customers’ data. The OpenAI team has taught the language model GPT-3 to compose texts similar to those that a human would write. A program belonging to Nvidia creates photos of people based on images of real individuals.

In healthcare, using synthetic data means that important analysis can be done without associating particular people with their medical records. After the outbreak of the coronavirus pandemic, the need for applying such data in healthcare has increased.

Synthetic Data in Healthcare

Secure data exchange is one of the major concerns in healthcare. According to the General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA), any confidential information can’t be disclosed without the consent of the person it belongs to.

Information from patient records must be stored and transferred securely Using the data without specifying the name of the patient is prohibited, too, as it is possible to identify an individual based on the data set.

That’s why it is more lawful and secure for researchers to create synthetic data as they conduct studies crucial for humanity. Prototypes of training software for machine learning models are trained with synthetic data so they can work with real patient data later. Developers don’t have access to the real information -- they can’t read it, extract it from software or use it in any other way.

See also: Avoiding Data Breaches in Healthcare

How Synthetic Data Is Generated

If the true patient is not at risk of being identified, information from real medical records can serve as the basis of synthetic data, though joint case records are much more commonly used. There is also the sort of approach that Mitre offers via Synthea, an open-source tool that allows for creating fictional patients based on publicly available information: scientific research data, disease statistics, demographics and so on. Although the generated dataset is not as reliable as “fakes” of the real medical records, the platform continues to be improved under the auspices of the U.S. government.

Although synthetic data is not suitable for studying real diseases and treatment methods, it can be the basis for the development of applications that allow for using real data without breaking the law.

Thus, synthetic data opens access to research and development of new technologies in healthcare.

Practical Applications of Synthetic Data

Soon after the pandemic outbreak, Israeli scientists began testing synthetic data technology based on EMRs from the last 20 years. Sheba Medical Center -- the country’s largest hospital -- used the MDClone platform to synthesize the data of its coronavirus patients.

The healthcare facility invited analysts who collected all the information about the virus from the data set. The result of the cooperation of medical researchers and software developers was an algorithm that helps the hospital staff decide when to prescribe medications or when inpatient treatment is needed.

The software allowed Sheba Medical Center to combine the data from its EMRs with the data belonging to another Israeli healthcare facility -- Maccabi HealthCare Services. This provided scientists with a broad view of the course of the individual disease, helping estimate coronavirus outcomes for each person. Without synthetic data technology, the project would have taken much longer as permission to use confidential information would have been required.

Of course, medical scientists can’t rely solely on synthetic data in their research, but the data lets them easily analyze an unlimited number of hypotheses that can lead to significant time savings during the approval of new drugs for real patients.

See also: Wake-Up Call on Ransomware

Although some data security experts doubt that synthetic data in healthcare can ensure patients’ anonymity, this data is extremely useful in prognostications, survival analysis, clinical trials, decision-making and more. Such technologies will accelerate innovation in healthcare while helping scientists comply with legislation.