Novel tool will create artificial datasets for rare disease research
SynthMD tool may aid in future study of diseases like AADC deficiency
Scientists have created a novel tool called SynthMD that can be used to generate artificial datasets for rare diseases, which could help accelerate the development of new software to advance rare disease research.
A paper describing this work, titled “Synthetic datasets for open software development in rare disease research,” was published in the Orphanet Journal of Rare Diseases.
Right now, researchers who want to study very rare disorders like aromatic l-amino acid decarboxylase (AADC) deficiency are often faced with a conundrum: Because these diseases are by definition rare, there’s typically very little data available for use in analyses.
To some degree, this can be dealt with by implementing infrastructure to collect and share patients’ data. But with very rare diseases, using information from actual patients raises concerns about privacy, as it can be easy to identify individuals based on these data. Disclosure of such data may lead to “societal stigma, discrimination, or harassment,” the researchers wrote.
Now, a trio of scientists in Germany have devised a new tool that uses machine learning to generate synthetic, or artificial, datasets. Machine learning basically works by feeding a dataset into a computer alongside a set of mathematical rules that the computer can use to teach itself and then find patterns within the data.
“The general idea is to use [machine learning] models trained on sensitive data to generate data that mirrors important statistical properties while not containing any real-world personal information,” the researchers wrote. “While these datasets are not suitable for generating new insights into rare diseases, they can be utilized for the development and evaluation of software for rare disease research.”
SynthMD tool can help protect patient privacy in rare disease reseach
As a proof-of-concept for this tool, which they dubbed SynthMD, the researchers created artificial databases mirroring data for three rare diseases among patients in the U.S. The team noted that they focused on U.S. data because “a lot of statistical information is available for US citizens and the population is quite diverse.”
The three diseases used were cystic fibrosis, Duchenne muscular dystrophy, and sickle cell disease.
While these diseases are rare, they aren’t as rare as AADC deficiency, which has less than 200 documented cases worldwide. Instead, these disorders each affect thousands of people in the U.S. alone. However, each meets U.S. criteria as a rare disease by affecting less than 200,000 people in the country.
By publishing these datasets for other researchers to use in their projects we hope to contribute to resolving the dilemma around data availability and the need to develop specific privacy-enhancing technologies for sharing rare disease data.
Regardless of case numbers, the team said SynthMD can be employed to create artificial datasets for any disease that has a starting set of real-world data available.
The scientists have made their tool and artificial datasets available to the wider research community in hopes of accelerating investigation into rare disorders.
“By publishing these datasets for other researchers to use in their projects we hope to contribute to resolving the dilemma around data availability and the need to develop specific privacy-enhancing technologies for sharing rare disease data,” they concluded.