Johns Hopkins investigators have created AstroID, an open-source database that brings together lab results, imaging and genetic data with patient histories. The goal is to make it far easier for scientists worldwide to study cancer at scale and over time.
A new open-source database from Johns Hopkins researchers could help scientists around the world ask bigger, smarter questions about cancer — and get answers faster.
Investigators at the Johns Hopkins Kimmel Cancer Center and The Johns Hopkins University have developed a novel data structure, called AstroID, that pulls together many different types of cancer information in one place. The system is designed to handle everything from lab results and genetic sequencing to imaging data, all linked to a patient’s clinical journey over time.
The team has already deployed AstroID in their own labs for 16 different patient groups with multiple tumor types. They report that more than 1 billion cancer cells have been spatially mapped and tagged with clinical information, giving researchers a powerful way to connect what happens in the lab with what happens to patients.
At its core, AstroID is about solving a problem that has long slowed cancer research: data scattered across systems, studies and time.
In oncology, a single patient’s care often stretches over years and includes many visits, treatments, scans and blood draws. To discover and validate biomarkers — biological signals that can predict how a cancer will behave or respond to treatment — scientists need to link all of those clinical events to multiple tests and assessments, including blood-based lab values, tissue pathology, radiology, genomic studies and more.
Janis M. Taube, director of the Division of Dermatopathology and co-director of the Tumor Microenvironment Laboratory at the Bloomberg~Kimmel Institute for Cancer Immunotherapy, noted the new structure changes what is possible with that information.
“What this structure does is allow me to ask questions across all of this data that’s already been gathered, and across tumor types, and combine it all together in the context of the longitudinal patient experience,” Taube said in a news release.
AstroID organizes information in six hierarchical tiers. At the top is deidentified patient information, followed by diagnosis details and clinical events such as treatments or blood draws. Below that are the specimens themselves — for example, tissue from a biopsy or blood for serology — and then the way those specimens are processed in the lab into tissue blocks, vials, individual slides or aliquots.
By mapping each step in this chain, the system makes it possible to trace any data point back through the patient’s experience and the physical samples it came from. That structure is built within REDCap, a widely used commercial web-based application for managing research data, and can be scaled to accommodate thousands of patients and the spatial characterization of billions of cells.
Before AstroID, Taube and her colleagues often had to rebuild data sets from scratch when they wanted to ask new questions.
Her lab, for example, frequently studies patients with melanoma. If they ran a study a decade ago focused on age at diagnosis and therapies received, and later wanted to examine survival in the same group, they might have had to reassemble the cohort, re-collect information on treatments, specimens and outcomes, and reconcile overlapping efforts with other teams.
“Investigators across the whole institution are also trying to tap into these patients and collect this information,” Taube added. “There were really huge inefficiencies across how we were working, and duplicating efforts.”
Those inefficiencies also limited the size of many studies. Manually entering and curating data is painstaking, so researchers often focused on relatively small patient cohorts. That can make it harder to detect subtle patterns, especially in complex diseases like cancer.
The goal was to break through those limits, according to Alexander Szalay, Bloomberg Distinguished Professor and professor in the Department of Computer Science at Johns Hopkins and director of the Institute for Data Intensive Science.
“What we are trying to do is to scale out so we can handle patients on the order of hundreds or thousands of patients in a study,” Szalay said in the news release.
To do that, the team needed a structure that could keep clinical and specimen data organized, flexible and ready for large-scale analysis.
Szalay credited early-career researchers with a key insight.
“One of our postdoctoral students, Elizabeth Will, in partnership with graduate student Benjamin Green, came up with this wonderful idea of how to organize all the medical and specimen data into multiple hierarchical tiers, which then can be easily translated to a query-oriented platform based on a large relational database,” he said.
That design allows AstroID to serve as a bridge between the day-to-day work of clinicians and lab scientists and the powerful tools of modern data science. Once data are exported, they can be explored on their own for research on clinical outcomes, or merged and queried alongside a wide range of scientific correlates, such as molecular markers or imaging features.
Although the Johns Hopkins team is currently using AstroID for cancer studies, they note that the same structure could be adapted for any disease that involves tracking biospecimens and clinical events over time. Chronic conditions like autoimmune diseases, neurodegenerative disorders or long COVID could all benefit from a similar approach to organizing and linking data.
AstroID is also intentionally open. The code is publicly available on GitHub, and additional documentation has been released so other institutions can adopt or adapt the system for their own research programs. By sharing the framework, the Johns Hopkins team hopes to accelerate discovery beyond their own campus and encourage more standardized, interoperable data practices across the field.
The work behind AstroID, published in the Journal for Immunotherapy of Cancer, brought together clinicians, pathologists, computer scientists and biostatisticians, reflecting the increasingly interdisciplinary nature of modern cancer research.
As more centers begin to generate massive imaging and sequencing data sets, tools like AstroID may become essential infrastructure. Instead of letting valuable information sit in separate silos, researchers can connect it, re-use it and build on it, study after study.
For patients, the hope is that this kind of data integration will translate into more precise biomarkers, better predictions of who will benefit from which therapies and, ultimately, more personalized and effective cancer care.
Source: Johns Hopkins Medicine
