In the present issue of the
American Journal of Psychiatry, Smucny et al. suggest that predictive algorithms for psychosis using machine learning (ML) methods may already achieve a clinically useful level of accuracy (
1). In support of this perspective, these authors report on the results of an analysis using the North American Prodrome Longitudinal Study, Phase 3 (NAPLS3) data set (
2), which they accessed through the National Institute of Mental Health Data Archive (NDAR). This is a large multisite study of youth at clinical high risk for psychosis followed up on multiple occasions with clinical, cognitive, and biomarker assessments. Several ML approaches were compared with each other and with Cox (time-to-event) and logistic regression using the clinical, neurocognitive, and demographic features from the NAPLS2 individualized risk calculator (
3), with salivary cortisol also tested as an add-on biomarker. When these variables were analyzed using Cox and logistic regression, the model applied to the NAPLS3 cohort attained a level of predictive accuracy comparable to that observed in the original NAPLS2 cohort (overall accuracy in the 66%–68% range). However, several ML algorithms produced nominally better results, with a random forest model performing best (overall accuracy in the 90% range). Acknowledging that a predictive algorithm with 90% or higher predictive accuracy will have greater clinical utility than one with substantially lower accuracy, several issues remain to be resolved before it can be determined whether ML methods have attained this utility “threshold.”
First and foremost, an ML model’s expected real-world performance can only be ascertained when tested in an independent sample/data set that the model has never before encountered. ML methods are very adept at finding apparent structure in data that predict an outcome, but if that structure is idiosyncratic to the training data set, the model will fail to generalize to other contexts and thus not be useful, a problem known as “overfitting” (
4). Internal cross-validation methods are not sufficient to overcome this problem, since the model “sees” all the training data at certain points in the process, even if some is left out on a particular iteration (
5). Overfitting is indicated by a big drop-off in model accuracy moving from the original internally cross-validated training data set to an external, independent cross-validation test. Smucny et al. (
1) acknowledge the need for an external replication test before the utility of the ML models they evaluated using only internal cross-validation methods can be fully appreciated.
Is there likely to be a big drop-off in accuracy of the ML models reported by Smucny et al. (
1) when such an external validation test is performed? On one hand, they limited consideration to a small number of features that have previously been shown to predict psychosis in numerous independent samples (i.e., the variables in the NAPLS2 risk calculator [
3]). This aspect mitigates the overfitting issue to some extent because the features used in model building are already filtered (based on prior work) to be highly likely to predict conversion to psychosis, both individually and when combined in a regression model. On the other hand, the ML models employed in the study use various approaches to find higher-order interactive and nonlinear amalgamations among this set of feature variables that maximally discriminate outcome groups. This aspect increases the risk of overfitting given that a very large number of such higher-order interactive effects are assessed in model building, with relatively few subjects available to represent each unique permutation, a problem known as the “curse of dimensionality” (
6). Tree-based methods such as the random forest model that performed best in the NAPLS3 data set are not immune from this problem and, in fact, are particularly vulnerable to it when applied on data sets with relatively small numbers of individuals with the outcome of interest (
7).
The relatively low base rate of conversion to psychosis (i.e., 10%–15%), even in a sample selected to be at elevated risk as in NAPLS3, creates another problem for ML methods; namely, such models can achieve high levels of predictive accuracy in the training data set simply by guessing that each case is a nonconverter. Smucny et al. (
1) attempt to overcome this issue using a synthetic approach that effectively up samples the minority class (in this case, converters to psychosis) to the point that it has 50% representation in the synthetic sample (
8). Although this approach is very helpful in preventing ML models from defaulting to prediction of a majority class, its use in computing cross-validation performance metrics is likely to be highly misleading, given that real-world application of the model is not likely to occur in a context in which there is a 50:50 rate of future converters and nonconverters. Rather, the model will be applied in circumstances in which new clinical high risk (CHR) individuals’ likelihoods of conversion are computed, and those CHR individuals will derive from a population in which the base rate of conversion is ∼15%. It is now well established that the same predictive model will result in different risk distributions (and, thereby, different thresholds in model-predicted risk for making binary predictions) in samples that vary in base rates of conversion to psychosis (
9). Given this, a 90% predictive accuracy of an ML algorithm in a synthetically derived sample in which the base rate of psychosis conversion is artificially created to be 50% is highly unlikely to generalize to an independent, real-world CHR sample, at least as ascertained using current approaches.
When developing the NAPLS2 risk calculator, the investigators made purposeful decisions to allow the resulting algorithm to be applied validly in scaling the risk of newly ascertained CHR individuals (
3). Key among these decisions was to avoid using the NAPLS2 data set to test different possible models, which would then necessitate an external validation test. Rather, a small number of predictor variables was chosen based on their empirical associations with conversion to psychosis in prior studies, and Cox regression was employed to generate an additive multivariate model of predicted risk (i.e., no interactive or non-linear combinations of the variables were included). As a result, the ratio of converters to predictor variables was 10:1 (helping to create adequate representation of the scale values of each predictor in the minority class), and there was no need to use a synthetic sampling approach given that Cox regression is well suited for prediction of low base rate outcomes. The predictor variables chosen for inclusion are ones that are easily ascertained in standard clinical settings and have a high level of acceptability (face validity) for use in clinical decision making. It is important to note that the NAPLS2 model has been shown to replicate (in terms of area under the curve or concordance index) when applied to multiple external independent data sets (
10).
Nevertheless, two issues continue to limit the utility of the NAPLS2 risk calculator. One is that it will generate differently shaped risk distributions on samples that vary in conversion risk and in distributions of the individual predictor variables, making it problematic to apply the same threshold of predicted risk for binary predictions across samples that differ in these ways (
9,
11). However, it appears possible to derive comparable prediction metrics across samples with differing conversion risks when considering the relative recency of onset or worsening of attenuated positive symptoms at the baseline assessment (
11). A more recent onset or worsening of attenuated positive symptoms defines a subgroup of CHR individuals with a higher average predicted risk and higher overall transition rate and in whom particular putative illness mechanisms, in this case an accelerated rate of cortical thinning (
12), appear to be differentially relevant (
11).
The second rate-limiting issue for the utility of the NAPLS2 risk calculator is that its performance in terms of sensitivity, specificity, and balanced accuracy, even when accounting for recency of onset of symptoms, is still in the 65%–75% range. Although ML methods represent one approach that, if externally validated, could conceivably result in predictive models at the 90% or higher level of accuracy, such models would continue to have the disadvantage of being relatively opaque (“black box”) in terms of how the underlying predictor variables aggregate in defining risk and for that reason may not be used as readily in clinical practice. Alternatively, it may be possible to rely on more transparent analytic approaches to achieve the needed level of accuracy. It has recently been demonstrated that integrating information on short-term (baseline to 2-month follow-up) change on a single clinical variable (e.g., deterioration in odd behavior/appearance) improves the performance of the NAPLS2 risk calculator to >90% levels of sensitivity, specificity, and balanced accuracy; i.e., a range that would support its use in clinical trial design and clinical decision-making (
13). Importantly, although the Cox regression model aspect of this algorithm has been externally validated, the incorporation of short-term clinical change (via mixed effects growth modeling) requires replication in an external data set.
Smucny et al. (
1) are to be congratulated on a well-motivated and well-executed analysis of the NAPLS3 data set. It is heartening to see such creative uses of this unique shared resource for our field bear fruit, reinforcing the value of open science. As we move forward toward the time and place in which prediction models of psychosis and related outcomes have utility for clinical decision making, whether those models rely on machine learning methods or more traditional approaches, it will be crucial to insist on external validation of results before deciding that we are, in fact, “there.”