More, Faster, Cheaper Eureka Moments: Using Time-Series Data to Transform Clinical Trials
Many of the barriers to faster, cheaper, more effective clinical trials can be broken down by using time-series vector data.
By Nataraj Dasgupta, VP of Advanced Analytics, Syneos Health®, with Mark Palmer
Running clinical trials on behalf of pharmaceutical companies is a non-trivial task. As of May 2023, there were more than 450,000 clinical trials registered globally on clinicaltrails.gov, with more than 64,000 studies actively recruiting. Trials are more globally distributed than ever, and the average number of eligibility criteria to participate in a trial has grown to 50.
In the US, at least, approximately 3 to 5 percent of physicians and patients take part in clinical trial research. This is partly a reason why 9 out of 10 trials double their original timeline in order to meet enrollment goals and 11 percent of research sites fail to activate; that is, to enroll a single patient. These barriers to faster clinical trials are costly to patients in need as well as the bottom line. Delays can cost trial sponsors between $600,000 and $8 million for each day.
Making a meaningful difference in trial enrollment requires more than a few tweaks to the edges of current, manually intensive models. In the case of Syneos Health®, a leading fully integrated biopharmaceutical solutions organization, an AI-driven application is providing new ways to imagine what’s possible in the search for a solution for how to make clinical trials cheaper and faster.
The art of clinical trial selection: good guesses
Syneos Health is one of the world’s largest Clinical Research Organizations with 29,000 employees. When Syneos Health acquired RxDataScience, it was in part to help the firm identify the best possible subjects for the trials in which their sponsors – pharma and biotech companies – were enrolling. Avoiding poorly structured trials can save millions in patient recruitment costs, patient retention expenses, trial protocol revisions, and costly trial timelines.
The scale of the data we faced was enormous: 200 billion HIPAA-compliant, deidentified patient-level medical and pharmacy claims records from almost 300 million patients. The International Classification of Disease (ICD) codes, Procedure codes, and National Drug Codes that track patient progress through all stages of their health journey within the US government and private healthcare systems are the basis of this data. It’s a huge corpus of information we can use to gain insights into the health of the broader population.
Extreme agility: three months to final product
Our data team had marching orders to develop an easy-to-use suite of tools that provide insights into patient journeys and can be used to pinpoint highly specific cohorts of patients for trials. Unlike traditional practices for identifying trial sites and recruiting patients (most based on a combination of trial manager experience and trial + error), our goals were to:
Create ad-hoc patient cohorts based on complex inclusion and exclusion criteria
Enrich the data with provider and payer information for commercial insights
Add clinical trial site-level information to support predictive analytics:
Ascertain the best countries to conduct trials
Evaluate the best clinical trial in these countries
Assess site feasibility, activation time, and non-enrollment probability
Conduct trial-patient matching to optimize activation time
Ensure clinical trial sites had a good mix of diverse patients
The FDA guidance on increasing Diversity in Clinical Trials made the last point even more compelling. But there was only one wrinkle. We had three months to build the entire system.
The special sauce: time series data
Based on my background in high-frequency trading, we chose a time-series vector database to serve as the core of the solution.
Why a time-series vector database? Time-series are an ideal way to organize trial data. Fundamentally, the questions we need to ask are based on customer journeys, which happen in sequence. We need to ask questions like: what happened in this trial first, then next, then next? Why did patients drop out in this trial versus this other, nearly identical one? Does this kind of protocol create too much patient burden? A foundational understanding or time, order, and sequence is essential.
Why vector? Because representing data in the form of columnar vectors stored in individual files delivers much higher performance when it comes to querying data as opposed to data stored and represented in a row-based tuples-based format.
A hybrid vector-and-time-series data solution was critical for us. Time is both a vector and a characteristic of any data object, so a time-series vector database was a perfect fit. Vectors index and store structured and unstructured data (providers, facilities, drug types, ICD codes, insurers, etc.) as numerical, enumerated values that can be searched in columnar form.
It worked. Vector processing sped up our time-to-insight, in many cases, 100 times as compared to contemporary and traditional databases, thanks to its foundational understanding of order.
In addition to vectors, we needed an in-memory compute engine, a real time streaming processor, and an expressive query and programming language. With an ordinary database solution, it would be natural to anticipate a large data team, a cluster of many servers, and a lot of big data tools to carry it off. Instead, we needed just one server, with one big data platform that included one database.
There was one more challenge left to be solved – a front-end application with kdb+ integration. Fortunately, at RxDataScience, we had previously developed a unique JavaScript low-code framework for web-based applications featuring native support for kdb+. Using this framework solved one of the more time-consuming parts of an enterprise app development cycle, streamlining our entire operation.
Fast cohort discovery, by example
Two months later, our solution was operational. Our users can now choose which healthcare networks to search, select precise patient eligibility criteria, and get nearly instant insights that can help them design more effective trials. And we can do it 96 percent faster at a tiny fraction of the cost.
For example, we could create a simple cohort of patients…
In North Carolina,
who had contracted COVID-19 over a one-year period,
within 30 days of being in contact with,
another COVID-19 patient,
did not lose their sense of taste or smell
We could carve out specific providers and facilities that treated these patients, the drugs they were prescribed and which insurers they used, and all the patient demographics, all with point-and-click functionality.
And we can gain this insight in seconds from 200 billion records.
Based on our analysis of contemporary database technologies in use across the industry, the time and cost savings were profound. These queries used to cost $50 per question and take more than two hours to execute. The kdb+ based solution takes 5 minutes, at a cost of a few cents. It is a powerhouse: an enterprise-grade EHR cohort analysis platform, no less feature-rich than popular EHR/EMR solutions, a multi-billion dollar industry pervasive in pharma – built in-house at 1/10th the cost, in record time, yet delivering 10x the performance.
I recall one of our first inter-department conference calls to demonstrate the new system. After the session, a senior leader mentioned with a mix of resignation and awe, “I thought this was a call to discuss the project plan…. It looks like the project’s all done ?!”
At first, some didn’t believe the metrics – the performance, financial, and agility figures appeared too good to be true. But after a year of enterprise-wise use, the proof is in the numbers: 1000+ users explore clinical trials every day. We design 8,000+ cohorts every year. At a cost that is orders of magnitude lower than the traditional approach adopted by other global EHR and EMR platform and data providers.
Time-series data warehousing
For applications where time is a major factor, it simply doesn’t make sense to build around a database model designed for rows and columns. Vectors are the right organizational construct for such healthcare data.
Because of this, our vector-based infrastructure has moved from the lab to be our primary system for data warehousing. We use it not only to plan new trials, but to understand the historical context of factors that impact trials, predict the best conditions for new trials, and to minimize risk and avoid errors in the context of geographic, cultural, or regulatory factors.
If you’re looking for a way to enable differentiating insights in your organization while helping users ask questions quickly in the way they think about them, consider a time-series vector database. They’re not just for quants anymore.
Vector Database Central is a reader-supported publication sponsored by KX. To receive new posts and support our work, consider becoming a free or paid subscriber.