Below, a Guest Post by Dr. Nortin M. Hadler, Professor of Medicine and Microbiology/Immunology, University of North Carolina at Chapel Hill and Dr. Robert A. McNutt, Professor of Medicine, Chief, Section on Medical Informatics and Patient Safety, Rush University Medical Center, Chicago. Their argument that comparative effectiveness research (CER) needs an “anchor”—one treatment with known efficacy—is a good one, and gave me a new perspective on CER. In their analysis of randomized controlled trials, they highlight the crucial question: how high should we set the bar to consider the results of the trial compelling?
“Comparative effectiveness research” is now legislated as a priority for translational research. The goal is to inform decision making by assessing relative effectiveness in practice. An impressive effort has been mobilized to target efforts and establish a methodological framework. We argue that any such exercise requires a comparator with known and meaningful efficacy; there must be at least one anchoring group or subset for which a particular intervention has reproducible and meaningful benefit in randomized controlled trials. Without such, there is a likelihood that the effort will degenerate into comparative ineffectiveness research.
As charged in the American Recovery and Reinvestment Act, the Institute of Medicine defined comparative effectiveness research (CER) as “ …the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat and monitor a clinical condition, or to improve the delivery of care… at both the individual and population levels.”
However, you can’t compare treatments for effectiveness from observational data unless you are certain that one of the comparators is efficacious. There must be at least one group of patients for whom the treatment has unequivocal efficacy. Otherwise, CER might discern differences in relative ineffectiveness. We argue that CER cannot succeed as the primary mechanism to assure the provision of rational health care.
The difference between efficacy and effectiveness
The science of efficacy tests the hypothesis that a particular intervention works in a particular group of patients. CER asks whether an intervention works better than other interventions in practice where patients are more heterogeneous than those recruited and accepted in a trial. The gold standard of efficacy research is the randomized controlled trial (RCT). RCTs usually monitor defined, albeit sizeable populations for surrogate outcomes in order to detect a difference in the short term. Modern biostatistics has probed every nuance of the RCT paradigm. The result is a highly sophisticated understanding of its limitations. A particularly vexing limitation is that the RCT fails to test hypotheses broadly enough; that is, RCTs limit the variability of patients making it difficult to generalize the value of treatments to those not studied.
CER to the rescue?
The methodology employed for CER is not constrained by limits on patient variability as in RCTs. CER utilizes real world data sets to deduce benefit/ harm in a range of patients -including those who might reasonably be excluded from a RCT. This entails large clinical and administrative networks to provide data. Datasets must be large enough to capture individuals’ differences that affect the estimates of benefit/harm across the gamut of insurance, age, co-morbidities, and life style. This inclusivity is paramount. For example, when we buy a book at Amazon.com, we are given a list of “other books bought by those who bought your book”. There is a data mining program in the background that links characteristics of the book you bought to characteristics of books bought like yours and to the characteristics of buyers. A different list of book recommendations results based on variations in buyer characteristics, like age, gender, and purchase history. This is a perfect analogy to what CER promises.
But, there are fundamental differences between book buying and health care provisions. In book buying there is a defined/ homogenous outcome, the book; health care outcomes are not homogeneous and often subjective (life, function, jobs, fancier hospitals, etc.). It is hard to imagine the messy list from “Amazon” health we would see based on what we chose as a goal of health care; it is easy to imagine how readily perturbed the list would be by introducing nuances in outcome. One of the fundamental problems with attempts to rationalize health care is that we still don’t agree on how to measure either health or what is rational care.
Furthermore, Amazon is not the sole vendor of books. The associations at Amazon may not reflect the totality of characteristics (books and people) across all places books can be bought. Hence, any book list suggested solely by Amazon may be incomplete or flawed. For CER to be a valid “Amazon” for health care, it has to define and capture the nuances of health care outcomes and provision across all sites of care (including the home).
Clearly any inference regarding relative benefits and harms from the analysis of large datasets is suspect. Shortcomings relating to benefits, harms and provision of care are lurking. Any statistical modeling would require assumptions and compromises. Hence, the validity of interpreting observational data will depend on the degree to which diagnosis, clinical course, interventions, coincident diseases, personal characteristics or outcomes is assumed and not quantified. No matter how compulsively this is done, CER demands judgments about the importance of each of these variables. Therefore, CER cannot be the engine of health care decision making.
As an example, total knee replacement (TKR) has at present escaped efficacy testing. How would we learn from observational research if TKR works? Some of the relevant variables to assess efficacy can be parsed from observational data such as patient demographics, type of hardware, co-morbidities, and the like. However, some variables are very difficult to parse in the best of circumstances – such as a definition of benefit; or surgical experience; or, more elusive, surgical skillfulness.
Efficacy research is the horse; CER is the cart.
There are 2 alternative ways forward other than the present plans for CER. First, we could design efficacy trials that are efficient in providing gold standards across a wider range of patient characteristics. We would have to expand trials to larger populations. For the sake of validity, we would have measure only a single clinically meaningful outcome even if that took a great deal of time. And we’d have to foreswear all shortcuts that trade reliability off against efficiency (such as “tack on” questions for “post-hoc” analysis).
There is a second approach that is more straightforward. We can design elegant RCTs seeking a large enough clinically meaningful outcome on highly selected patient populations. If none is detected, we can either abandon the intervention or choose another highly selected population to study. If a clinically meaningful difference is detected, the result can serve as the anchoring comparator for CER.
However, to design such a straightforward RCT, we must also deal with the philosophical challenge in the design of efficacy trials; the challenge that relates to the notion of “clinically significant.” How high should we set the bar for the absolute difference in outcome between the treated and control groups to consider the results of the trial compelling?
One way to think about this is to convert the absolute difference into a more intuitive measure, the Number Needed to Treat (NNT). If the outcome is easily measured, such as death or stroke, for example, we might find an intervention valuable if we had to treat 20 patients to spare 1. Few students of efficacy would be persuaded if we had to treat more than 50 to spare 1. Between 20 and 50 delineates debate; smaller effects are ephemeral and subject to false positive assertions. For an outcome that is more difficult to measure, such as symptoms or quality of life, we would argue for a more stringent bar. If we framed the problem of RCT design like this, we may be able to engage a national debate on just how high the bar should be set for each clinical malady.
If we, then, applied this stringency to future RCT design, trials would be more efficient and reliable and would eliminate trials aimed to test equivalency. Then, armed with clinically meaningful RCT results in some subset of patients, we are in the position to turn to CER. CER will help us seek out other subsets of patients benefited at least as much and to identify subsets harmed. We feel that it would not be in the best interest of our public and personal health to prematurely seek answers in flawed datasets at the expense of forgoing best evidence in better RCT designs.
CONFLICT OF INTEREST DISCLOSURES: The authors have none to report.