We demonstrated that response in MDD can be predicted using pretreatment and early-treatment multimodal MRI and clinical data. Validation on the placebo arm and sertraline-treated placebo nonresponders suggested specificity for sertraline treatment compared with placebo treatment. Our results indicated that predictors with strong scientific evidence are the primary drivers of model performance. Finally, multimodal prediction outperformed most unimodal approaches, and models using ASL predictors showed the best unimodal predictions, even with pretreatment data only.
Multimodal Modeling
Compared with other multimodal studies, our results confirm the positive conclusions of Patel et al. (
34) and Leaver et al. (
35). These studies’ small sample sizes of 44 and 19 patients, respectively, raise the possibility that their findings are affected by performance bias (
36). More recently, Sajjadian et al. (
11) presented an approach using data from the Canadian Biomarker Integration Network in Depression (CAN-BIND-1) study. Compared with that study, our results are an improvement both in terms of bAcc and AUROC on corresponding analyses (pretreatment prediction of response: bAcc, 65% vs. 60%; AUROC, 0.71 vs. 0.60; early-treatment prediction of response: bAcc, 68% vs. 66%; AUROC, 0.73 vs. 0.70). This improvement is larger for pretreatment prediction than for early-treatment prediction. Since CAN-BIND-1 follow-up clinical data were acquired at week 2 instead of week 1 as in EMBARC, we suspect that the CAN-BIND-1 clinical data might present a larger early-treatment effect. Since the two approaches are methodologically similar, another explanation for our improved results is the availability of ASL in EMBARC but not in CAN-BIND-1. Our unimodal results substantiate this hypothesis, indicating significant predictive power for ASL alone.
We performed three post hoc analyses: First, our findings are consistent with leave-site-out cross-validation (see Table S8 in the online supplement), suggesting that site effect is limited. Second, comparing alternative classifiers to the XGBoost classifier shows that these are also capable of predicting treatment response, albeit at lower performance levels (see Table S9 in the online supplement). Lastly, we found that regression models can predict continuous outcome scores (see Table S10 and Figure S1 in the online supplement).
When compared with the predictive performance reported by two recent meta-analyses of treatment response prediction based on MRI for pharmacotherapy, our approach outperformed the weighted estimate of the natural logarithm of the diagnostic odds ratio (2.46 [SD=1.10] and 2.11 [SD=0.56], respectively) (
10), as well as the mean bAcc for adequate-quality studies (68% [SD=10] and 63% [SD=7], respectively) (
36).
Several considerations should be borne in mind when qualitatively evaluating the utility of the model. First, it should be noted that with the accuracy measure used in this work (balanced accuracy), the actual accuracy will be higher when the response rate is greater than 50%. An AUROC >0.7 can be considered good, depending on the use. In our case, our model outperforms the current trial-and-error standard used in clinical practice. Thus, the benefits of utilizing our treatment planning support tool, which can predict treatment efficacy early on, might quickly outweigh the costs of MRI scanning. Additionally, our work outperforms studies of similar scope. However, the clinical effectiveness of treatment decision support tools should be determined in independent prospective randomized clinical trials.
In clinical practice, it might also be desirable to increase the detection rate of nonresponders at the cost of decreased sensitivity (indicating nonresponse in final responders). This would allow improved identification of true nonresponders to switch to another treatment early at a higher, acceptable rate of false “nonresponder” assignments among patients who might respond to a second selective serotonin reuptake inhibitor (SSRI) or serotonin-norepinephrine reuptake inhibitor (
37). Finally, the lower bound at which performance predictive modeling using MRI balances burden and cost should be investigated.
Unimodal Versus Multimodal Modeling
Our results indicate consistently significantly higher performance in multimodal compared with unimodal models. This supports the claim that integrating neuroimaging data with clinical data can improve treatment response prediction. However, models of two modalities did perform significantly better than chance: pretreatment models using ASL only and early-treatment models using clinical assessment data only. These two data types are primary candidates for a lean treatment response model. As an exploratory analysis, we report on the performance of these models in Table S13 in the
online supplement. Nevertheless, our findings corroborate the call for multimodal data integration to improve predictive performance (
9).
Although differences in study design limit comparisons with other studies, the performance of unimodal models was similar to that of previous unimodal studies (this vs. other). This confirms results achieved by previous studies and is a sign that our multimodal results may be similar in other study populations. Bartlett et al. (
39) predicted remission based on T
1-weighted MRI (bAcc, 53% vs. 51%, and AUROC, 0.55 vs. 0.59, respectively). Korgaonkar et al. (
40) predicted remission using baseline diffusion-weighted imaging (bAcc, 52% vs. 54%). Finally, Chekroud et al. (
41) used early-treatment clinical variables for predictions of remission (bAcc, 65% vs. 63%, and AUROC, 0.69 vs. 0.70, respectively).
Interpretability
Balancing prediction performance with interpretability is an ongoing challenge in neuroimaging-based modeling. However, given that even the best-performing neuroimaging-based models show modest prediction performance, more emphasis on interpretability may be warranted (
42). Therefore, we performed a post hoc exploration of feature importance, as shown in Figure S2 and Table S11 in the
online supplement. Early-treatment response and perfusion in the ACC consistently contributed to our model’s performance. ACC perfusion is a well-replicated biomarker for treatment response (
8), and this finding aligns with previous work in MDD. A recent meta-analysis found that the ACC plays a crucial role in treatment outcome because of its role as a hub in the interplay between the ventral emotion-generating and dorsal cognitive-controlling systems. Here, SSRIs are thought to play a role in improving this ventral-dorsal control (
43). After 1 week, we observed a high importance of symptom severity reduction, a known predictor (
44). Unexpectedly, the hippocampal volume—the most replicated finding concerning treatment response in MDD (
8)—did not consistently contribute. A possible explanation is that predictors were selected individually, while a strong effect of left lateralization of hippocampal volume in treatment prediction has been noted (
45). Similarly, the connectivity in the default mode network measured with rs-fMRI did not contribute consistently to our models. For more information on predictor importance, see Figure S2 and Table S11 in the
online supplement.
Strengths and Limitations
The design of the EMBARC trial allowed us to increase the generalizability of our results in three ways. First, EMBARC is the largest multimodal neuroimaging MDD data set available, and the sample size has been shown to reduce bias in performance estimation (
9,
36). Second, data were acquired at four sites using 3-T scanners from different manufacturers. We preprocessed and harmonized the data using standard software packages to increase generalizability to new populations (
46) and tested our results using leave-site-out cross-validation. Third, we performed validation in a separate study arm, instead of just internal cross-validation as in most studies (
10). Performance was not significantly reduced in the separate set of sertraline-treated patients, which is a positive indicator of the generalizability of our method. An additional strength is that the hypotheses and methods in this work were preregistered prior to analysis. Preregistration avoids overestimation of performance due to overfitting, which happens when algorithms are tailored to training data after data analysis. Although preregistration of tier evidence (
8) inhibits the exploration of other combinations of predictors, it increases the reproducibility and validity of our results since the prespecified predictors (tier 1) were backed by strong scientific evidence (systematic reviews or meta-analyses).
While the study design provides valuable strengths, several limitations should be acknowledged. First, our results are limited to sertraline treatment from data acquired in the EMBARC study. In the EMBARC study, information about response to previous antidepressant treatment was not available, which might be valuable to add to future predictive tools (
47). Thus, further validation of our approach is required on even larger collections of external data, for different antidepressants, and in populations with more clinical heterogeneity than in a randomized clinical trial. Until then, our results should be interpreted with caution. To our knowledge, no data set exists that has collected the same multimodal data at the same time points as the EMBARC trial that would allow such external validation. Second, although excluding patients did not significantly affect population characteristics, excluding patients who did not complete the study’s first phase may have favored prediction results in the direction of adherent participants. Third, the EMBARC study’s design allowed us to validate our models on the same patients treated with placebo (subgroup B) and sertraline (subgroup C). However, subgroup C lacked patients from subgroup B who had responded to placebo treatment in the first phase of the study. This selection can be expected to incur a bias in our results. Since response is known to be partially driven by placebo response (
38), lowered treatment response and decreased treatment response prediction performance in subgroup C could have been expected. However, the response rate in subgroup C was not negatively affected by this selection compared with subgroup A (
Table 1). Our results also show that performance in subgroup C was not significantly reduced compared with subgroup A. Thus, we can assume that this type of selection bias did not affect our conclusions. Finally, although task-based fMRI might provide promising information to add to multimodal modeling (
48–
50) and likewise provide important insights into context-dependent neural mechanisms, we excluded this option, given the difficulty of their scalability and replicability in clinical practice (
51). Therefore, we could not assess its potential benefits in treatment response prediction (
8).
We found no significant difference in performance between pretreatment and early-treatment prediction. If externally validated, early-treatment response prediction would likely not require a second session of MRI scanning, reducing cost and lowering patient burden. Because our results show a performance drop in unimodal ASL models at early treatment but not at pretreatment, and we decided a priori to use relative changes at early treatment, which are sensitive to physiological variability, we suspect that this choice was suboptimal for ASL predictors. To overcome this limitation, we suggest combining absolute ASL predictors with early-treatment clinical predictors to improve performance. Other improvement options include integrating predictors from other sources and utilizing novel analytic methods, such as normative modeling (
52).
In summary, our findings on a multimodal machine-learning-based method applied to data from the EMBARC trial show that pretreatment and early-treatment prediction of sertraline treatment response in MDD patients is feasible using brain MRI and clinical data and significantly outperforms chance and most unimodal models. Our results also suggest the specificity of our models for sertraline compared with placebo treatment. We found that ASL was the best unimodal predictor. With additional external validation, these findings will contribute toward the use of predictive modeling in individualizing clinical sertraline treatment of patients with MDD.