The ability to accurately predict conversion to psychosis from clinical and other measurable features of an at-risk state is critically important to clinicians. To demonstrate clinical utility, these models should predict conversion with at least 80% sensitivity and specificity (
1) with concurrently high positive and negative predictive value (PPV/NPV). Over the past decade substantial progress has been made in this area with the development of “risk calculators,” which consider various demographic, clinical, and neurocognitive factors in addition to family history to predict future conversion (e.g.,
2,
3). The most well-studied of these, the risk calculator based on the second North American Prodromal Longitudinal Study (NAPLS) 2-based calculator, achieved a 71% model concordance index (analogous to the area under a receiver operating characteristic curve) (
2).
These encouraging results helped motivate the NAPLS-3 (
4) study. The NAPLS-3 includes longitudinal measurements from 710 individuals at clinical high risk for psychosis and 96 age- and sex-matched healthy control participants (
4). To our knowledge, the ability of the features specified in the NAPLS-2 calculator to predict conversion in the NAPLS-3 sample has not yet been evaluated. We thus examined the ability of these features as well as cortisol (assessed at baseline) to predict conversion in clinical high risk using various linear (e.g., Cox proportional hazards regression, logistic regression, support vector machine) and nonlinear (e.g., random forest) machine learning algorithms. We hypothesized that these features would predict conversion with performance in line with models from other data sets, with some variability depending on the machine learning algorithm. We also hypothesized that nonlinear machine learning methods would perform qualitatively better than linear machine learning methods due to their ability to model complex nonlinear relationships.
Participants
The NAPLS-3 is an NIMH-funded study conducted at nine sites. All participants provided written informed consent, including parental consent for minors. The study was approved by all sites’ institutional review boards.
A detailed description of NAPLS-3 participants (including exclusion criteria) is provided in Addington et al. (
4). Briefly, 710 clinical high-risk individuals and 96 individuals in the healthy control group were recruited and followed for up to 2 years, with some longer exceptions (see Results). Participants were between 12 and 30 years old. Predictors included those used by the NAPLS-2 calculator (riskcalc.org/napls; see
Table 1 for list). As a recent study found that salivary cortisol improved prediction in the NAPLS-2 (
5), we examined models both with and without cortisol as a predictor. Participants in the healthy control group and clinical high-risk participants who lacked follow-up data were not included in machine learning models.
Consistent with prior work (
6), conversion to psychosis was defined as meeting the Presence of Psychotic Symptoms criteria: one of the five SIPS Scale of Psychosis-Risk Symptoms positive symptoms must reach a psychotic level of intensity (rated 6) for ≥1 hour per day for 4 days per week during the past month in the clinical high-risk individual, and/or the clinical high-risk person must show that these symptoms seriously impact their functioning.
Analyses
First, as performed previously (
2), a Cox proportional-hazards regression analysis was performed using these predictors (SAS v.9.4) to examine consistency with prior NAPLS-2 findings.
For machine learning, standard algorithms were employed using Weka software (University of Waikato, New Zealand) and included logistic regression, naive Bayes, a three kernel support vector machine, KStar, J48 decision tree, random forest, decision stump (with 100 iterations of AdaBoost), and multilayer perceptron. Classifier accuracies were calculated by averaging performance across 100 random assortments of 90% training data and 10% test data for each algorithm. Individuals with missing data were excluded from analysis. Because of class imbalance, prior to machine learning training data for the minority (converter) class was upsampled using the Synthetic Minority Oversampling Technique (SMOTE) (
7). Due to class imbalance (see Results), the minority class was 400% oversampled in the present study, with
k (number of nearest neighbors) set to five. We also determined feature importance ranking for the best classifier based on contributions to receiver operating characteristic area under the curve.
Results
Demographic and clinical information for participants (including the healthy control group) is provided in
Table 1 and
Table 2. As previously reported (
8), relative to the healthy control group, clinical high-risk participants had lower Brief Assessment of Cognition in Schizophrenia (BACS) symbol coding and Hopkins Verbal Learning Test (HVLT) scores, more trauma, greater decrease in social functioning over the past year (i.e., prior to baseline), more undesirable life events, higher SIPS delusions plus suspicions score, and greater salivary cortisol. A higher percentage of clinical high-risk participants also had a first-degree relative with psychosis.
Examining conversion rates, out of 598 clinical high-risk participants with complete data, 62 converted and 536 did not over the course of the follow-up period. The average time from baseline to conversion was 278 days, with a range of 4 to 1,361 days. Four clinical high-risk individuals converted more than 2 years after their baseline assessment.
Results of the Cox regression analysis without cortisol suggested that the overall model was significant (likelihood ratio χ2=26.04, p=0.001, Harrell’s concordance index=0.70 [SE=0.04], mean specificity [across time]=0.67, mean sensitivity=0.62, mean PPV=0.15, mean NPV=0.95). Including cortisol did not substantially improve the model (likelihood ratio χ2=24.93, p=0.003, Harrell’s concordance index=0.70 [SE=0.03], mean specificity=0.54, mean sensitivity=0.75, mean PPV=0.14, mean NPV=0.96).
Machine learning performance metrics for each machine learning algorithm are provided in
Table 3. Briefly, all models performed significantly above chance. The algorithm that showed the best overall performance was random forest. Including cortisol as a predictor did not appreciably alter performance metrics of most algorithms. Feature importance in order of greatest to lowest for the random forest algorithm was as follows: baseline SIPS P1 and P2 (delusions plus suspiciousness), HVLT raw score, number of undesirable life events, number of trauma types, BACS symbol coding raw score, decrease in global social functioning over the past year, age, having a first-degree relative with psychosis, and cortisol.
Discussion
As expected and previously reported (
8), clinical high-risk participants in the NAPLS-3 had a greater percentage of first degree relatives with psychosis, worse neurocognition, more trauma and deleterious life events, greater decrease in social functioning prior to baseline, and higher levels of psychotic symptoms compared to the healthy control group. Clinical high-risk participants also had higher cortisol, possibly indicative of greater chronic stress levels compared to the healthy control group. Cox regression performance was comparable to previous clinical high risk studies (
2,
9). Logistic regression performance (66%–68% accuracy, depending on inclusion of cortisol) was in line with prior studies (
2,
3,
5,
9–
13). All machine learning algorithms performed above chance, with accuracies 65% and higher. As hypothesized, linear methods (Cox regression, logistic regression, support vector machine) showed worse performance compared with most nonlinear methods (e.g., random forest). Furthermore, the highest performing algorithm (random forest with or without cortisol) achieved ∼90% accuracy while maintaining >75% sensitivity and >85% specificity, PPV, and NPV. Baseline SIPS delusions plus suspiciousness score was found to be the most important predictor.
Although it was expected that all algorithms would perform better than chance at predicting conversion to psychosis in clinical high-risk individuals, it was somewhat surprising to find that the best algorithm (random forest) performed at such a high level given that previous studies suggest that these features predict conversion with accuracies (or metrics related to accuracy, e.g., concordances) between ∼70% and 80% (
2,
3,
5,
9–
13). Notably, however, the majority of these studies used regression-based modeling to predict conversion (logistic regression performed worse than most other methods in this study), and no studies used the random forest algorithm. What aspects of the random forest may have enhanced performance to this degree? First, unlike most classifiers, a random forest is an “ensemble” classifier, in which the predictions (converter or non-converter) of several decision trees are tallied and the majority vote is used to make an overall prediction (
14). These individual trees are comprised of random combinations of features, such that each tree makes its vote independent of and decorrelated from all others. The decision boundary induced by a random forest is therefore highly nonlinear compared with some other methods (e.g., logistic regression). Because not all the features are used in each tree, the random forest is relatively immune to the “curse of dimensionality,” where increasing the number of features causes overfitting unless the sample size is also exponentially increased in parallel. Averaging the votes of decision trees also helps reduce the overall variance. As the generalizability of this performance enhancement is unclear, an interesting future direction would be to apply the random forest algorithm to predict conversion in clinical high-risk individuals using other data sets (e.g., the NAPLS-2).
Limitations of the present analyses were the small sample size (particularly for converters) and heterogeneity of sample outcome (time to conversion ranged from 4 to 1,361 days). The imbalanced data set also necessitated the use of a minority class oversampling procedure (SMOTE) to prevent models from defaulting to predict the majority class (results without SMOTE showed poor sensitivity and PPV [data not shown]). The converter/non-converter distribution for training models in this study may not be representative of the general clinical high-risk population. Our result also requires replication in an independent data set to determine if overfitting occurred during machine learning as a result of SMOTE. Overall, however, the relatively high level of performance of random forest and other methods suggests that when features selected from previous, independent studies are combined with modern machine learning methods, performance levels of clinical outcome prediction may approach the performance standards needed for a predictive biomarker that provides early identification of individuals likely to transition to psychosis. Provided these results can be replicated in other clinical high risk data sets, researchers can thus begin searching for the primary causes of this transition while preparing for delivery of palliative care. In the context of study limitations, when asking “are we there yet?” in regard to the development of predictive biomarkers for psychiatric practice, the answer may be, “We’re on the way, but we need more data.”