Which words are most often associated with intended use statements?

When I was working on my master’s degree in data science, one of the projects I did was to create an algorithm that would take a medical device claim and predict whether the FDA would require a clinical trial. It worked quite well, with about 95% accuracy.

Since this is a dynamic algorithm where the user enters an intended use claim and receives a prediction of the FDA’s decision, I wanted to do a similar task this month: create a static word cloud to show which words are most associated with intended use claims for which the FDA required a clinical trial. In theory, at least, this static representation could give you a sense of the words in the intended use claim that are more likely to propel your device toward a clinical trial.

Results

In this word cloud, the size of a word reflects its frequency of occurrence in claims regarding the intended use of devices for which the FDA requires a clinical trial.

Methodology

I started with the FDA 510(k) database, and my first task was to extract the intended use statements from the PDFs on the FDA website. This is a lot of work, but I won’t bore you with the details here. It involves first reading all the PDF summaries of the 510(k) available on the FDA website, and then extracting the intended use statements from those PDFs. I hope that one day the FDA will introduce a little structure to the 510(k) summaries. Even a little structure would be a great help to researchers. It would also introduce some consistency to the intended use statements, where regulatory compliance varies widely.

The next task was to identify those 510(k)s where a clinical trial had been conducted. This is both easy and difficult. The easy part is to look through the FDA database on its website—it is not in openFDA—to identify all those 510(k)s where there is an NCT number reflecting a clinical trial registered on clinicaltrials.gov. I wish they would include this information in openFDA, but they do not. The OpenFDA databases are years behind the databases on the FDA website.

I was tempted to leave it at that, but I know full well that many companies conduct clinical trials to submit with their 510(k)s that are not registered on clinicaltrials.gov, in part because not all trials are required to be registered. As a result, my next step was to review the 510(k) summaries and, by searching for specific keywords that are almost exclusively associated with clinical trials, identify additional 510(k)s in which clinical trials were part of the mix.

At this point, I had a list of all 510(k)s, a list of all 510(k)s that have a clinical trial, and the intended use statements for all 510(k)s. In round numbers, doing this in April 2024 and using a data set that starts in January 2001, I had about 46,000 510(k)s that didn’t have a clinical trial and about 4,000 that did. In other words, in that data set, only about 8% of 510(k)s were associated with clinical trials. I suspect that’s an underestimate, but I also know that, particularly in the earlier years, full clinical trials weren’t as common for 510(k)s.

My goal, as I said above, is to identify the words most frequently associated with devices that involved a clinical trial as part of the 510(k) process. I decided that the best way to represent this mathematically is to calculate 1) the frequency of all the words used in intended use claims that involve a clinical trial and subtract 2) the frequency of all the words used in intended use claims that do not involve a clinical trial. So if a word occurs more often in a clinical trial intended use claim than it does not occur in an intended use claim that does not involve a clinical trial, that would be a positive number. So I was only interested in positive numbers. A negative number would be associated with intended use claims that are more likely not to involve a clinical trial.

Mechanically, I used the nltk library to tokenize and then calculate the frequency. I focused on the 1000 most frequent words. I removed many stop words because I consider them uninformative. In addition to the common stop words, I removed words like “intended”, “human”, “patients”, “results”, “clinical”, “healthcare”, “device”, and “indicated”.

So that’s where I ended up with the words for the word cloud. The representation shows the degree to which the words are more common in intended use claims for devices with clinical trials, compared to intended use claims for devices that do not include clinical trials.

Context

It’s hard to simply interpret a word cloud without a bit of context. As a result, for context, but also to honestly assess how well or not well I did at identifying reports that included clinical trials, I thought I’d outline the therapeutic areas where clinical trials are most common. The chart below indicates the frequency of clinical trials by therapeutic area using the techniques for selecting reports that included clinical trials described above.

This chart aligns with my understanding of FDA expectations for clinical trials based on the submissions I have seen, but let me know if it differs from your experience.

Interpretation

When I look at a word cloud, what strikes me between the eyes is the predominance of words commonly associated with in vitro diagnostics. In fact, it’s as if all the most popular words are probably related to IVD.

While it is certainly true that IVDs often require clinical trials, and that this is probably a macro trend overall, this is at odds with the high frequency of clinical trials for cardiovascular devices. The explanation I can think of is that there is a common vocabulary for a wide range of IVDs, while individual cardiovascular devices often have their own specialized words. While almost any IVD could be called a test or analysis, cardiovascular devices such as pacemakers and defibrillators have their own special labels.

It is also true that there are many categories of medical specialties related to IVD clinical diagnostics, including microbiology, clinical chemistry, immunology, clinical toxicology, and hematology. If we add up all of these categories of IVD diagnostics, we have the area with the greatest advantage in clinical research.

I see some terms related to cardiovascular devices, such as EKG, and some radiology terms, such as images and x-rays, but again, they are not as common as the vocabulary related to in vitro diagnostic devices.

Overall, this analysis indicates that if you use words that describe your product as a laboratory test, it is more likely to be linked to a clinical trial in a 510(k) filing.

Application

Of course, there are no magic words that automatically mean a clinical trial is required. But it is disproportionately likely that if you use words that actually describe a laboratory test, a clinical trial may be required. The most common words are probably the least interesting, simply because we could all have predicted them. I find some of the mid-range words more interesting. For example, I am fascinated by the frequency of the word “software.” It is also interesting that the word “flu” is so prominent. I would not have guessed that such a common disease would be associated with devices that require a clinical trial. I am also surprised to see the word “monitoring” so prominent, because we often think of monitoring as a fairly low-risk task. The word “respiratory” also surprises me, simply because I have not personally seen that many clinical trials involving the respiratory system. I hope you will find insight into some of the mid-range words that may not be obvious.