Measures used to evaluate the quality of health care services are commonly tied to compensation, reimbursement, and reputation; are used to incentivize care quality; and hold service providers accountable (
1,
2). The results from evaluations that involve use of quality measures are often publicly reported so that patients and purchasers can decide where to seek or buy health care (
3). Thus, the measures on which evaluations of quality and performance are based serve a central role in policy making, health care administration and delivery, and the quality of health care that patients receive.
A reasonable question is whether there are too few or too many quality measures of mental and substance use disorder care (
4). Such measures are numerous and vary in how rigorously their measurement properties (e.g., reliability and validity) have been evaluated, in the strength of evidence supporting their core concepts or use to improve quality, and how much they overlap with similar measures (relevant for harmonization efforts) (
5–
8). In 2015, Patel et al. (
7) found in a systematic environmental scan of 510 mental health quality measures that 10% had received National Quality Forum (NQF) endorsement and that 5% were used for national quality reporting initiatives in the United States. In 2016, Goldman and colleagues (
9) found in a systematic environmental scan of 730 quality measures of integrated medical-behavioral health a heavy focus on care during and after hospitalization and an inadequate representation of the full range of evidence-based treatment options (e.g., in outpatient settings). The field appears to have both too many quality-of-care measures overall and not enough good measures in specific areas. For the field to move forward, it is essential to identify the measures with the most potential to promote implementation of evidence-based mental health services, pinpoint where developers should focus on filling measurement gaps, and find redundant measures or those with undesirable characteristics that should no longer be used (
10).
The landscape and definitions of quality measures change rapidly. Measure inventories have changed since previous publication of measure reviews. The National Quality Measures Clearinghouse (NQMC), the primary catalog curator of measures in the United States for 17 years, was defunded in 2018 (
11). Past studies did not include measures developed at the Veterans Health Administration (VHA), the largest integrated health care system in the United States and a leader in mental health performance management (
12). Therefore, an updated and restructured study on the landscape of clinical quality measures for mental and substance use disorder treatments is warranted. The primary purpose of this study was to assess measurement gaps (e.g., measure characteristics and clinical focus) and redundancies to inform quality-of-care measure development and retirement efforts. We also aimed to provide a catalog containing a snapshot of quality measures that can be sorted by key measure attributes, because no single authoritative source currently exists for stakeholders to consult on the large number of available measures. We also make recommendations for strategies to enhance and curate the catalog into the future.
Methods
We searched public information on quality measures from six organizations involved in nationally standardized evaluation and comparison of mental or substance use disorder care quality. Measures were identified and cataloged from March 1, 2019, to October 31, 2020. We scanned the entire inventory of quality measures from the Office of Mental Health and Suicide Prevention and Office of Performance Measurement of the U.S. Department of Veterans Affairs (VA) (
13–
15), the Measure Inventory Tool of the Centers for Medicare and Medicaid Services (CMS) (
16), the NQF Quality Positioning System (
17), the 2019–2020 Healthcare Effectiveness Data and Information Set of the National Committee for Quality Assurance (NCQA) (
18), the National Healthcare Quality and Disparities Report of the Agency for Healthcare Research and Quality (
19), and the strategic plan for fiscal years 2019–2023 of the Substance Abuse and Mental Health Services Administration (SAMHSA) (
20).
Measures were included in our catalog if they had been defined through the use of symptoms or diagnoses for substance-related and addictive disorders (excluding tobacco use disorder), depressive disorders, trauma and stressor-related disorders, anxiety disorders, schizophrenia spectrum and other psychotic disorders, bipolar and related disorders, or suicide and related behaviors (
21). Measures not defined with disorders were included if they were defined with the use of data elements relevant to psychotropic medications or in any care delivery setting for managing specialty mental or substance use disorders or if the measure was in use by a national mental or substance use disorder quality improvement initiative (e.g., the CMS Inpatient Psychiatric Facilities Quality Reporting program [
22]). We included measures that evaluated care for adults ages ≥18 years; measures that evaluated only pediatric services were not included.
Measures deemed to tap the same quality construct, or with an identical definition, were reconciled to create a set of unique constructs of quality measures to evaluate care for patients with mental or substance use disorders. We abstracted information about attributes of each measure construct that might be of interest to evaluators, purchasers, and other stakeholders who must choose among available measures (
Box 1). Measures defined via the use of disorder symptoms or diagnoses were coded according to specific mental or substance use disorders, or medical disorders (multiple disorders when applicable), because evaluations commonly focus on specific disorders. We coded each measure’s NQF endorsement status as an indicator of critical, independent, and transparent review of measure properties (e.g., reliability and validity) (
23). We coded each measure’s “current use” in national quality improvement initiatives as an indicator of its potential for cross-system comparisons. For example, for CMS measures current use was defined as having an “active” status associated with any of the agency’s programs listed in the CMS Measure Inventory Tool (
16). Additionally, because measures are used by different types of health care organizations (provider vs. payer) and for different goals (internal quality improvement vs. pay-for-performance), we categorized measures by attributes of interest to stakeholders: modality of treatment, type of quality measure, domain of quality, level of analysis, and the data source from which a measure can be calculated (
23–
29). See
Box 1 and the
online supplement to this article for detailed definitions of coded measure characteristics.
We summarized the landscape of measure constructs by their attributes, presented as counts and percentages of unique measure constructs. The complete measure catalog is available in the
online supplement. This study was approved by the Stanford University Human Subjects Research and Institutional Review Board.
Results
Of 4,420 measures reviewed, 14% (N=635) met our inclusion criteria. After we had reconciled measures with the same definition or construct, we included 376 unique constructs of measures of quality of care for mental or substance use disorders in our catalog (
Figure 1). Among these cataloged measures, symptoms and diagnoses of specific disorders were used most often in defining the measures (46%, N=172). Among the 54% (N=204 of 376) of measures not defined by disorders, experience-of-care measures were most common (27%, N=54), followed by measures defined by using inpatient psychiatric stays (18%, N=37) (
Table 1).
Table 2 cross-classifies whether a measure was disorder based, NQF endorsed, or used in a national quality improvement initiative with other measure attributes. Note that many of these classifications were not mutually exclusive. Ninety-five measures (25%) were endorsed by the NQF, indicating that only one in four met independent review criteria (e.g., reliability and validity). The CMS and NCQA inventories overlapped more with NQF’s inventory of endorsed measures (81% and 28%, respectively) than with the VA or SAMHSA inventories (10% and 1%, respectively). We identified 319 quality measures actively in use for national evaluations, our indicator of potential for use in cross-system comparisons. The VA inventory had the most measures used in national quality improvement efforts (N=193), followed by the CMS (N=102) and NQF (N=98) inventories. Of measures in use for national quality improvement efforts, process measures were more commonly used (57%) than outcome measures (30%).
Regarding other measure attributes, we observed substantial overlap (or lack of specificity) regarding treatment modalities; two-thirds of the measures (67%, N=252) each tapped multiple treatment modalities. Pharmacotherapy measures were included most often among the treatment modalities we characterized (58%, N=185) (
Table 2). In terms of quality domains, measures assessing clinical effectiveness were well represented, with focus on other quality domains such as patient-centeredness, and communication being more prominent among measures not defined by a disorder. Whether categorized by CMS Meaningful Measures Area (N=26) or Institute of Medicine (IOM) Quality Domains (N=24), few measures were represented across domains such as “make care safer,” “work with communities,” “make care affordable,” “timely,” “efficient,” and “equitable.” Regarding analysis level, 77 measures (20%) were specified for more than one analysis level, and two measures (1%) were specified for all six analysis levels. We identified 261 measures (69%) that could be calculated from routinely collected electronic data sources such as electronic medical records (EMRs) or standardized electronic clinical data.
Figure 2 displays a more detailed analysis of cross-classifying measures defined by disorder, measure type, and treatment modality. Small black circles in the figure represent the number of measures defined with symptoms or diagnosis of only a single disorder; larger gray circles represent the number of measures that were defined with any of multiple disorders. Of 72 quality-of-care measures for substance use disorder treatment, 51 would evaluate quality of care for all substance use disorders combined; the other 21 measures would evaluate symptom screening, treatment planning, or pharmacotherapy specifically for alcohol or opioid use disorders. Depressive disorders were represented most often (60%, N=73) among the 121 mental disorder quality measures. Measures for schizophrenia- and bipolar-related disorders (50%, N=60) were often grouped into a “serious mental illness” construct; only 13 of these measures would evaluate care for one of these disorders separately. Fewer measures have been defined to evaluate care quality for trauma- and stressor-related disorders (12%, N=15) or anxiety disorders (1%, N=1) independent of other mental disorders (
Figure 2).
Discussion
The landscape of quality-of-care measures to evaluate mental or substance use disorder treatments is vast and rapidly changing. Comprehensive inventories such as the NQMC have been defunded or have not been recently updated (
11,
30). In the absence of a centralized curator, measures are siloed within disparate agency repositories. We sought to identify gaps and redundancies in the current landscape of quality-of-care measures for mental and substance use disorder treatments among adults, culled from six different organizations, and to make available a snapshot of measures organized by their attributes. We hope that measure developers can use this information to focus on current gaps and not add to the already significant redundancies in this corpus. We also hope that evaluators, purchasers, and other stakeholders can use the results of this study, along with data available from agency repositories, to select and sort measures by important attributes: clinical focus (e.g., disorders or treatment modalities), units to be analyzed (e.g., provider or payer), NQF endorsement status, data source, and potential for cross-system comparison (
31). Ideally, measure selection would also be informed by evidence indicating a measure’s psychometrics and likelihood to improve patient outcomes. However, we found no such systematically graded evidence in the inventories scanned, a long-standing issue (
6,
32) that all stakeholders should work to remedy.
Major Measure Gaps
Any new measure should complement the already vast and expanding landscape of quality-of-care measures for mental or substance use disorder treatments (
6–
9). We found more outcome measures than did previous studies (
7,
9), of which experience-of-care measures were most numerous. We found no outcome measures based on symptom improvement or remission for most disorders, other than depressive disorders, and few options to evaluate outcomes such as functioning or health-related quality of life. Process measures still dominate the landscape. Researchers have called for developing measures of evidence-based psychosocial treatments (vs. generic codes) (
6)—a gap that persists. For example, we cataloged 161 measures that evaluate provision of treatments including psychotherapy (
33), but only four of those measures would enable a distinct evaluation of psychotherapy. Psychotherapy is part of 40%−50% of outpatient treatments of mental health or substance use disorders and is the only treatment modality for 10%−16% of patients who receive outpatient treatment (
34). Devoting four of the 376 total measures in the landscape to distinct evaluations of psychotherapy quality is not representative of the types of mental health care provided to patients.
Quality domains underrepresented in previous studies—equity, safety, and patient and family engagement in care (
7,
9,
35)—appeared to be slightly better represented in our catalog, but still with significant room for increasing representation of measures within these domains. Other measure features that lacked coverage were those that assess structural aspects of quality (e.g., night and weekend hours of services and waivered buprenorphine prescribers), population identification and access, and efficiency. We found more measures calculated from EMR and patient-reported data than were reported in previous studies (
7), suggesting a promising trend toward use of data sources that may be required to address gaps in available measures. However, measures specified and validated for use at multiple levels of analysis remain scarce, potentially limiting efforts to align quality improvement across stakeholders (clinicians, purchasers, and others). Evidence of a measure’s reliability and validity at one level does not necessarily generalize to other levels. Developers should analyze cross-level applications of their measures.
The Need for Independent Evaluation of Measures Before Use
Before measures are used, they should be independently and transparently evaluated. We note that 81% of CMS measures were NQF endorsed. Although CMS is not required to subject measures to NQF evaluation, it typically does so to ensure that only the highest-quality measures are used in CMS programs. In NQF’s review process, each measure is evaluated for importance (evidence and gap in performance), scientific acceptability (reliability and validity); feasibility, usability, and use; and harmonization (
36). One of the most important gaps revealed by our analysis was the relative rarity of NQF-endorsed measures or of other independent and transparent evaluations, especially for measures in the VA and SAMHSA inventories. A policy intervention that could improve the landscape of quality-of-care measures for mental and substance use disorder treatments would be to more strongly encourage or require independent and transparent evaluation of measures before they are used in federal or high-stakes programs (e.g., national pay-for-performance initiatives).
Lump or Split?
Important trade-offs exist between measures that combine diagnoses (e.g., all substance use disorders) and measures targeting specific diagnoses. Combining diagnoses into a single measure is more efficient, reduces measurement burden, and has face validity where care delivery processes or patient outcomes are transdiagnostically relevant (e.g., access to care). However, combined measures risk missing clinically important details about specific conditions and masking poor performance for low-prevalence diagnostic groups. Patel et al. (
7) reported that 32% of measures were defined with more than one disorder. More than half of the measures defined by disorder in our study were defined with more than one disorder. Combining diagnostic groups might reflect efforts to harmonize measures and reduce measurement burden (
6–
9). However, our results provided no clear evidence of greater overall harmonization since publication of previous studies that systematically scanned for and examined measures used to evaluate the quality of mental and substance use disorder treatment. The development of more diagnostically focused measures may be warranted where measures are lacking (e.g., anxiety disorders and drug use disorders other than alcohol and opioids). Before developing such measures, it would be informative to analyze how well performance on combined measures reflects performance for diagnostic subgroups and how tightly diagnosis-specific processes and outcomes are linked with more general measures.
Essential Investment in Curation
A significant limitation of this study was that it will soon be out of date. No research program can sustain the continual curation needed, and the status of measures will need to be verified with agency repositories. Given the resources spent by the federal government on developing and evaluating new measures, and the significant potential for waste by developing redundant measures and failing to identify measurement gaps, investment in a national curation program is essential. CMS alone spent $1.3 billion between 2008 and 2018 to develop nearly 2,300 measures, 35% of which are being used in CMS programs (
37). This massive investment in infrastructure is just one among federal, state, and private organizations’ measurement enterprises, each of which is largely siloed from the other. The NQMC was a national resource (
11) and strong curation program that should be revived. It is time to fulfill committee recommendations from the IOM to establish a comprehensive and dynamic curation system (
32,
38), perhaps supported by a consortium of federal (e.g., VA and CMS) and state partners who would directly benefit from measure curation. Such a consortium could focus on overall measure harmonization and specification, systematically grade evidence about measures, and extend essential work already in motion such as independent reviews conducted by the NQF.
Reducing and Harmonizing the Measure Landscape
In this study, we distilled 635 individual measure results into 376 unique measure constructs. For example, we found seven nearly identical versions of “follow up after hospitalization for mental illness,” which is fewer than the 25 versions of this measure construct reported by Patel et al. (
7). However, our recommendations to independently evaluate measures, including overlap with existing measures, could reduce the overall number of measures. Furthermore, our recommendation to revive and fund a centralized curator of quality measures could facilitate awareness of existing measures so developers can explicitly and routinely assess measure harmonization and retirement issues.
Limitations
A limitation of this study was that we did not include measures related to certain public health issues and disorders (e.g., tobacco use and cognitive disorders) or measures used in pediatric care. Nonetheless, our study reflects core services commonly evaluated in national quality initiatives aimed at improving treatments for mental and substance use disorders. Our study also was not designed to systemically review the underlying evidence base across the hundreds of quality measures cataloged, nor was information on graded evidence available for most measures. Although we could not verify whether a sufficient evidence base existed for each measure, we cataloged whether each measure was endorsed by the NQF, an indicator that a measure’s properties, including underlying evidence, have been independently and critically reviewed. Another possible limitation of our study was that we relied on public inventories; we therefore may have missed internal measures used in organizations. Nonetheless, we reviewed >4,400 measures housed in public inventories. Moreover, this is the first study reporting on measures developed at the VA, rather than measures defined by other organizations for evaluating VA care (
39). Including VA measures is important because they are used to evaluate the largest integrated health care system in the United States (
12).