Depression is a common behavioral health problem that is a leading cause of disability, lower quality of life, diminished productivity, and reduced employment rates globally (
1). In the United States, a recent study estimated that the total economic burden of major depressive disorder has reached at least $210.5 billion per year, a 21.5% increase from 2005 (
2).
Although a variety of evidence-based practices have been validated for the treatment of depression in diverse settings, it remains challenging to accurately and reliably measure outcome progress at the organization level. This challenge occurs for a variety of reasons: depression symptoms often improve or worsen regardless of intervention (
3), the literature is divided on how to define treatment response and remission (with various treatment success definitions used in research and clinical practice) (
4–
10), and exactly how changes in rating scale scores affect the outcomes most important to patients (e.g., quality of life and social connectivity) remains unclear. Nevertheless, outcome ascertainment for depression treatment is becoming increasingly important as measurement-based care and value-based payment become priorities for health organizations nationwide. There is a consequent growing need for pragmatic and scalable ways to assess treatment progress.
Of the numerous validated rating scales for depression, the Patient Health Questionnaire–9 (PHQ-9) (
7) has consistently been one of the most used and validated in primary care, specialty behavioral health, and research settings (
11). As a result, it is the instrument of choice for this investigation. The PHQ-9 has nine items that are each scored from 0 to 3, for a maximum score of 27; higher scores indicate a greater severity of depression symptoms.
Often, the PHQ-9 is used as a way to ascertain the severity of baseline depression symptoms and to track patients’ progress over time with treatment. In the literature, treatment success is usually quantified by using the terms “response” (or “partial response”) and “remission.” Although its definition is not universal, remission (on the PHQ-9 scale) is often defined as achieving a score <5 (for a patient with a previous score in a category suggestive of depression symptoms) (
12). The literature is more divided on treatment response, with different definitions described across studies (
4–
10). Some definitions include a single criterion, whereas others include multiple components. Importantly, some definitions specify a minimum baseline PHQ-9 score, whereas others do not (
8).
Organizations incorporating measurement-based care are tasked with choosing metrics, often one each for response and remission (although sometimes only one total metric is used). These decisions may be influenced by research studies, standardized organization-based recommendations (e.g., those from the National Committee for Quality Assurance’s Healthcare Effectiveness Data and Information Set), or the perceived frequency of metric use (with ≥50% change and score <5 most commonly used for response and remission, respectively). Organizations could take advantage of the lack of standardization nationwide and choose metrics that are easier to achieve, thereby making their clinical programs appear more successful, although this has never been formally demonstrated.
Table 1 includes eight PHQ-9 depression treatment success criteria (i.e., metrics) that have been described in the literature or identified in this study. Metric 5 (≥50% decrease from baseline and score <10) was originally proposed by Kroenke and colleagues (
9) as the PHQ-9 metric for “clinically significant improvement” (i.e., response) because these criteria would be consistent with the established Hamilton Depression Rating Scale metric. The same study established metric 8 (score <5) as the remission metric (although this specific term was not used) by defining scores <5 as “nondepressed” (
9). Additionally, metric 2 (absolute decrease of ≥5) was based on previous literature demonstrating that the minimal clinically important difference for the PHQ-9 is between 2.59 and 4.78 (
5). The other metrics, however, have little to no supporting empirical evidence.
One study used data from a 114-person collaborative care randomized controlled trial to compare outcomes based on metric 5 with structured interviews and three other depression metrics. In general, all measured metrics were found to have good agreement (κ>0.60) (
8). The authors also reported that metrics combining multiplicative terms (50% change) or absolute terms (≥5-point change) with the requirement of a score <10 tended to classify the same patients as improved or not improved (
8). However, unlike metrics defined by multiplicative terms, those predicated on absolute score changes do not “penalize” organizations with higher average baseline PHQ-9 scores.
In this investigation, we leveraged 10 years of longitudinal PHQ-9 data from the University of Washington’s Advancing Integrated Mental Health Solutions (AIMS) Center to analyze the extent to which different depression response and remission metrics influence organization-level performance. We then discuss the implications of these findings for measurement-based care, health systems, and research.
Methods
For years, the AIMS Center has supported practices implementing the collaborative care model (CoCM), an evidence-based practice for the treatment of common behavioral health problems in medical settings. Part of this support has included development and dissemination of the Care Management Tracking System (CMTS) (
13), a specialized treatment registry that records contact details with patients and facilitates measurement-based care (e.g., tracks PHQ-9 depression scores over time). With the written consent of participating organizations, we compiled a data set of 36,887 adult patients with depressive symptoms who were treated in one of 145 primary care clinics (across 33 organizations) and who had depression outcomes tracked using CMTS between 2008 and 2018. Health care organizations and clinics were located across nine states; approximately 83% (N=120) of the clinics were in urban areas (as defined by the Federal Office of Rural Health Policy), and 64% (N=93) were federally qualified health centers (FQHCs). Analysis of this deidentified data set was granted exemption status by the University of Washington Institutional Review Board (ID STUDY00005907).
Our analysis, which was conducted at the organization and clinic levels, included all patients ages ≥18 who had at least two documented PHQ-9 assessment scores: one at baseline and one or more within the following 12 months. PHQ-9 scores closest to and within 30 days of 3, 6, and 12 months from baseline were extracted and noted as the scores for those respective time points. Incorporating these criteria, we created three overlapping and nonexclusive time cohorts (3, 6, and 12 months), each including patients who had follow-up scores at that time point. Of note, these time cohorts were not mutually exclusive and were not a single cohort being followed longitudinally over time. For example, an included patient could have baseline PHQ-9 and follow-up scores at both 3 and 12 months. Such a patient would be included in the 3- and 12-month time cohorts but not the 6-month time cohort.
Further inclusion and exclusion criteria were applied at the clinic level. In the 3-, 6-, and 12-month time cohorts, a clinic’s data were included if at least one of its patients had a recorded PHQ-9 score. At 3, 6, and 12 months, 135, 130, and 113 clinics met inclusion criteria, respectively. This finding corresponded to 33, 33, and 32 organizations as well as 19,862, 11,303, and 3,308 patients, respectively. Missing race and gender data were imputed at the clinic level. (For baseline characteristics of organizations and clinics in each of the three time cohorts, see the online supplement to this report.)
First, mean improvement rates defined by the eight depression response and remission metrics (and weighted by clinic size) were calculated for the 3-, 6-, and 12-month cohorts. Next, to analyze the impact of metric choice on comparative organization-level performance, all 33 organizations in the 6-month time cohort were ranked according to their improvement rates across the eight metrics. We calculated ranks by using empirical Bayes predictions from a random-intercept logistic regression model with reliability adjustment, a strategy that has been used with other health outcome rankings (
14). We chose to calculate rankings using this random-effects, model-based approach (as opposed to using raw values or direct sample means) for two reasons: it reduced the impact of chance-driven uncertainty from small samples in certain groups because the between-groups variability was estimated by using data from all groups, and it made the rankings more reproducible over time (
14). Finally, Spearman’s rank-order correlation coefficients of predicted ranks from different metrics for the 6-month time cohort were calculated.
Results
Across all eight metrics, the 12-month cohort had higher rates of metrics indicating treatment success than the 3- and 6-month cohorts. Additionally, rates within time cohorts varied substantially by response or remission definition. In the 3-month cohort, for example, depression response rates ranged from 32% (metric 7) to 51% (metric 4), whereas the remission rate was 22% (metric 8). Additionally, response rates appeared to form two clusters: metrics 2, 3, and 4 were similar (ranging from 48% to 51% in the 3-month cohort), as were metrics 1, 5, 6, and 7 (ranging from 32% to 34% in the 3-month cohort). Similar ranges were observed for the 6- and 12-month time cohorts. All treatment response and remission rates, in addition to mean initial and time cohort follow-up PHQ-9 scores, are presented in Table
1. (For the 6-month time cohort correlation matrix that was calculated with Spearman’s rank-order correlation coefficient, see the
online supplement.) All pairwise rank-order correlation coefficients were positive, with mean=0.86, and only three of the 28 <0.75. Metric 2 was least correlated with the others. These results broadly demonstrate that across metrics, organization-level rank orders were highly correlated.
Discussion
In this analysis of PHQ-9 scores from 33 organizations and three time cohorts across nine states, we found that choice of depression treatment success metric led to markedly different rates of improvement. Response rates for the 3-month time cohort ranged from 32% to 51% depending on choice of metric, whereas remission was 22%. At the same time, we found that organization-level rank orders defined by performance on eight different metrics were uniformly positively correlated (and largely >0.75). These findings lead to two primary conclusions.
First, it is of paramount importance for organizations to be compared with benchmarks or with one another using the same depression response or remission metric. For example, if organization A uses metric 4 (≥50% decrease from baseline or score <10) and organization B uses metric 1 (≥50% decrease from baseline), their improvement rates cannot be meaningfully compared. We would expect organization A, with a compound metric including an “or” logical operator, to have the higher response rate even if the true rates of improvement were equivalent. This finding also has similar implications for research, in which consistent PHQ-9 definitions for depression treatment response, in particular, are lacking.
Second, our Spearman’s rank-order correlation coefficient findings suggest that the eight metrics all largely appear to tell the same story with regard to depression treatment response and remission. Of note, metric 2, although still positively correlated with the other metrics, was positively correlated to a lesser extent. One possible explanation is that it is the only assessed metric defined solely by an absolute change in PHQ-9 score over time. Depending on the initial PHQ-9 score, treatment success metrics defined in this way can be more or less challenging to achieve relative to those defined multiplicatively (i.e., ≥50% decrease from baseline). Citing similar reasoning, a recent study advocated for the use of multiplicative PHQ-9 metrics over commonly used threshold metrics (i.e., score <5) (
15).
Regardless of which metric is chosen, however, similar organizations in this study tended to be ranked favorably and were deemed to be high performing. This finding suggests that inter- and intraorganization consistency of metric use may be more important than which specific metric is chosen. In other words, organization-level depression outcome measurement and research efforts should, above all else, strive to compare the metric equivalent of “apples with apples.”
At the same time, our result show that differently defined response rates tend to cluster across metrics: metrics 2, 3, and 4 were similar, as were metrics 1, 5, 6, and 7. This finding suggests that when comparing organizations using different metrics within each of these clusters, there is perhaps less concern about the impact of metric choice. Furthermore, given that the eight metrics in this investigation are highly correlated, one possible application of our findings could be to provide a theoretical “conversion factor” for different metrics. For example, on the basis of the 3-month cohort data in Table
1, one could expect that a response rate defined by metric 2 (48%) would be roughly 1.5 times greater than that of metric 5 (32%) with no true differences among patients’ clinical statuses.
The findings in this investigation are limited by the real-world nature of this AIMS Center data set, which had missing information, including follow-up PHQ-9 scores and patient demographic characteristics. This limitation is evidenced by the comparative sizes of the time cohorts, with the 3-, 6-, and 12-month cohorts including roughly 50%, 30%, and 10% of the full sample, respectively. Although these findings were not surprising, they highlight the challenges associated with consistent longitudinal outcome ascertainment in real-world outpatient settings. Of note, we were able to reduce the impact of missing gender and race data in this investigation through imputation methods. Additionally, our sample of 33 organizations across multiple states remains one of the largest and most diverse real-world CoCM implementation data sets to date. Approximately half of included patients were persons of color, and two-thirds of clinics were FQHCs. Additionally, our relative lack of patient-level exclusion criteria makes our results externally valid and generalizable.
Conclusions
Our findings demonstrate that the choice of PHQ-9 response or remission metric substantially affects observed treatment improvement rates. Furthermore, organization-level rankings for depression response and remission vary depending on choice of metric, but their rankings are highly correlated. We therefore conclude that in organization-level PHQ-9 response or remission rate comparisons, metric consistency is imperative, whereas the specific metric chosen is of secondary importance.
Acknowledgments
The authors thank the 33 organizations that agreed to share their deidentified Care Management Tracking System registry data for this work. Through their generosity, a collaborative care implementation data set of more than 35,000 adult patients was created, providing a tremendous opportunity to help study and improve care for depression.
The authors report no financial relationships with commercial interests.