Evaluating the Machine Learning Literature: A Primer and User’s Guide for Psychiatrists
Machine Learning
Developing a Predictive Machine Learning Model
Data Collection and Cleaning
Algorithm Selection
Classifier | Description | Strengths | Weaknesses |
---|---|---|---|
Regression (logistic) | Estimates the probability of a binary outcome (yes/no or class 1/class 0) for each observation, by a linear combination of the predictors using the logistic (or sigmoid) function. If the decision boundary is set to 0.5, probabilities >0.5 will be labeled as class 1, while if <0.05, they will be labeled as class 0. | Easy to implement. Input features do not need to be scaled (no assumptions as to feature distribution). Feature selection can be regularized (to allow fitting with limited training data). Model coefficients are informative as to the relevance of a feature and direction of association (some inference). | Sensitive to outliers and multicollinearity (poor capture of complex relationships between features). Prone to overfitting with a large number of features. Requires a large amount of data. |
Support vector machine | Maps each observation in n-dimensional feature space. The decision boundary (or hyperplane) separates the outcome classes (0 or 1) while maximizing the marginal distance between the hyperplane and support vectors (observations that rest closest to the hyperplane and therefore are the most difficult to classify). | Less prone to overfitting than logistic regression. Able to detect nonlinear relationships between features. Performs well on semistructured and unstructured data. Regularized feature selection allows good performance on limited training data. Stable to changes in training data. | Sensitive to noise (overlapping outcome classes). Computationally expensive for large data sets. Requires transformation of categorical features to binary dummies (increases dimensionality of data set) and feature scaling. May require selection of a kernel function; poor choices can greatly alter results. |
Decision tree | Constructs a decision algorithm based on a series of greater-than/lesser-than comparisons. Starting at the root, the data are sequentially split based on which feature gives the highest information gain (most change in probabilities for a “greater” versus “lesser” answer). The splitting process continues until it reaches a leaf, which contains only one class of outcome labels (class 0 or 1). | Model is easily interpreted. Input does not need to be transformed (e.g., scaling, normalization). Tolerant of missing data (no imputation needed). Automatic feature selection (top branches=highest informative features). | Prone to overfitting, especially with a large number of features. Computationally expensive (increased training time and memory). Instability with even small changes in training data; may not generalize well to new data. |
Random forest | Multiple decision trees are created, and the outcome class (0 or 1) is determined by a majority vote of the generated decision trees. Each tree is built from a random subset of the data, reducing the influence of any one feature or data point on the outcome. | Less prone to overfitting than decision trees. Random subsets (ensemble training) make the resulting classifier more generalizable. | Computationally more expensive than decision trees. Often very difficult to interpret because features appear at different levels in individual trees. |
k-nearest neighbor | Prediction of outcome class (0 or 1) is based on whether the majority of points that are “near” the new example (in a derived feature space) are from class 0 or 1. k is the number of neighbors considered in the majority vote. | Not sensitive to noise or missing data. No underlying assumption about data distribution. | Computationally expensive with increasing number of features. All features given equal importance. Sensitive to outliers. May require feature transformations that distort data. Selection of k can greatly influence results. May generalize poorly to new data. |
Gaussian naive Bayes | Assumes all predictors are independent and equally important to prediction of the outcome class (0 or 1). Outcome class is determined by the highest posterior probability, a function of the prior probability of a class (class distribution, Gaussian) and the likelihood, or the probability, of a feature given a class. | Scales well to large data sets. Requires less training data. Robust to outliers. Ignores missing values. | Dependency between attributes negatively affects performance. Assumes all features are Gaussian (normally) distributed, which is often not true for clinical variables. |
Artificial neural network | Each node (or neuron) receives inputs (or signals) from other nodes in the network, processes the summed inputs through a mathematical scaling function, and transmits the output to the other neurons. During the training process, the strength of the input signal is adjusted (weighted) at each connection to minimize error in predicting the outcome class (0 or 1). | Tolerates nonlinear relationships between features and can use these relationships to improve performance. Can employ multiple learning algorithms. Can represent history-sensitive situations where a predictor’s importance depends on what came just before. | Computationally expensive to train, although can be made efficient to apply. Can be very sensitive to how features are preprocessed and extracted; computational expense makes it difficult to explore many options. Largely a “black box,” very difficult to understand what influences predictions. Can be very vulnerable to small, even apparently meaningless changes in input data. |
Data-Set Splitting
Preprocessing
Model Selection
Performance Evaluation
External Validation
Natural Language Processing
Model features | Considerations | Chekroud et al. (66) | Kessler et al. (67) | Rumshisky et al. (68) |
---|---|---|---|---|
Prediction | Does the prediction have clinical utility? How can the results be used in practice? | Remission of depressive symptoms in response to 12 weeks of citalopram treatment (final Quick Inventory of Depressive Symptomatology score ≤5) | Suicide death in 12 months following outpatient mental health visit | 30-day psychiatric readmission |
Data set | Single or multisite recruitment? Any data collection considerations (e.g., equipment, differing measures)? | Sequenced Treatment Alternatives to Relieve Depression Study, across six primary care sites and nine psychiatric care sites | Historical administrative data system of the Army Study to Assess Risk and Resilience in Servicemembers, 2004–2009 | Partners HealthCare electronic health records, including academic and community hospital and clinics in New England, 1994–2012 |
Subjects | Is this a representative patient population? Are there adequate data for the proposed analysis? Inclusion/exclusion criteria of subjects? | N=1,949; 18- to 75-year-old outpatients with nonpsychotic major depressive disorder and score ≥14 on the Hamilton Depression Rating Scale, 2001–2004 | N=975,057; male, nondeployed regular U.S. Army soldiers | 4,687 patients with inpatient discharge summaries; ≥18 years old, with a diagnosis of major depressive disorder; no additional exclusion criteria |
Class balance | Is class imbalance present? How is this handled in the analysis? | No class imbalance reported; 51.3% of subjects were nonresponders | 569 deaths by suicide with >8,000 control visits per suicide; probability sample of control visits used | 470 patients were readmitted within 30 days; no class imbalance correction |
Input features | Do feature extraction methods appropriately capture the desired signal? Are included features easily obtained in routine practice? Are the features appropriate to the prediction? Any sources of data leakage? | 164 features, including sociodemographic features, DSM-IV-based diagnostic items, depressive severity checklists, eating disorder diagnoses, prior antidepressant history, the number and age at onset of previous major depressive episodes, and first 100 items of the Psychiatric Diagnostic Symptom Questionnaire | Nearly 1,000 features, including outpatient visit clinical factors, prior clinical factors, Army career, prior crime, and contextual factors | Baseline clinical features: age, gender, use of public insurance, and age-adjusted Charlson comorbidity index score; 75 topics extracted by latent Dirichlet allocation from full corpus; top 1,000 words extracted by term frequency-inverse document frequency for each patient |
Algorithm | Was the use of a particular algorithm (or algorithms) over others justified? Were other algorithms evaluated and reported? Is the algorithm appropriate for the data and/or problem? | Gradient-boosting machine (ensemble of decision trees); no other algorithms were reported | Naive Bayes, random forest, support vector regression, and elastic net penalized regression were tested | Support vector machine; no other algorithms were reported |
Data splitting and resampling | Were cross-validation or other resampling methods used? Were these performed appropriately? Any sources of data leakage? | Ten-fold cross-validation | Cross-validation (type not reported); separate models for suicides with and without prior psychiatric hospitalization | Data set randomly split into training (70%) and testing (30%) data sets; balanced by clinical features; separate models for baseline clinical features, baseline plus 1,000 words, baseline plus 75 latent Dirichlet allocation topics |
Imputation | How is missing data handled? | Complete cases (patients with missing data dropped) | Missing data corrected by nearest neighbor or rational imputation | NA |
Feature selection | How were features selected? How many features survived? | Elastic net regularization to select top 25 features prior to model building | Univariate association of predictor of suicide compared with other death; significant univariate predictors plus 20 sociodemographic variables and 27 Army-career variables passed to machine learning classifiers; penalized regression for selection in final models | None |
Model selection | What metric was used to determine optimal performance (accuracy, AUC, custom metric)? Could this metric bias model selection? | Maximization of AUC | Maximized cross-validated sensitivity in the 5% of visits with the highest predicted suicide risk | Maximization of AUC |
Hyperparameter optimization | Any hyperparameters? What metric was used for their evaluation? Was a separate data subset used for hyperparameter optimization? | Same criterion as for model selection | Same criterion as for model selection | Threefold cross-validation on the training data |
Performance | Any evidence of overfitting (are the results “too good to be true”)? Are the results and proposed model believable? How portable is the model to other contexts? Were any attempts made at model simplification? | AUC=0.70 | Elastic net classifier with 10–14 predictors optimized sensitivity; AUC=0.72 (prior hospitalization), 0.61 (no prior hospitalization), and 0.66 (combined) within 26 weeks after visit | Baseline clinical features (AUC=0.618), baseline clinical features plus 1,000 words (AUC=0.682), and baseline clinical features plus 75 latent Dirichlet allocation topics (AUC=0.784) |
External validation | Was the model externally validated? Did performance drop significantly in application to new data? If so, is the model still clinically useful? Were reasons for the change in performance explained? Were there any potential hidden confounders or time effects affecting model performance? | Yes; validated in escitalopram treatment group (N=151) of the Combining Medications to Enhance Depression Outcomes trial (accuracy, 59.6%) | Yes; validated by using 2004–2007 data to predict 2008–2009 deaths by suicide; combined AUC (those with and without prior hospitalization) was 0.67–0.72 within 26–5 weeks after visit | No |
Pitfalls
Revisiting the Clinical Scenario
Conclusions
Footnotes
Supplementary Material
- View/Download
- 111.81 KB
References
Information & Authors
Information
Published In
History
Keywords
Authors
Author Contributions
Metrics & Citations
Metrics
Citations
Export Citations
If you have the appropriate software installed, you can download article citation data to the citation manager of your choice. Simply select your manager software from the list below and click Download.
For more information or tips please see 'Downloading to a citation manager' in the Help menu.
View Options
View options
PDF/EPUB
View PDF/EPUBGet Access
Login options
Already a subscriber? Access your subscription through your login credentials or your institution for full access to this article.
Personal login Institutional Login Open Athens loginNot a subscriber?
PsychiatryOnline subscription options offer access to the DSM-5-TR® library, books, journals, CME, and patient resources. This all-in-one virtual library provides psychiatrists and mental health professionals with key resources for diagnosis, treatment, research, and professional development.
Need more help? PsychiatryOnline Customer Service may be reached by emailing [email protected] or by calling 800-368-5777 (in the U.S.) or 703-907-7322 (outside the U.S.).