The global dementia epidemic carries a widespread emotional and financial burden on patient families, caregivers, and society (
1). Currently, dementia of the Alzheimer’s type is the sixth leading cause of death in the United States, yet it is the only disease among the top 10 causes of death that cannot be prevented or cured (
2). To date, clinical trials for Alzheimer’s disease therapeutics have been universally disappointing.
One significant factor for the slow progress is the lack of powerful early detection methods of cognitive impairment. Alzheimer’s disease is characterized by the deposition of beta amyloid (Aβ) and hyperphosphorylated tau, resulting in plaques and neurofibrillary tangles, respectively. One hypothetical biomarker model describes the temporal order of disease stages as follows: Aβ plaque accumulation; neuronal injury; brain structure atrophy; memory loss; and general cognitive decline (
3). Clinical trials may fail because these neuropathological changes precede cognitive deficit manifestations by several decades (
4–
8). Consequently, irreversible brain damage may have already occurred. Thus, identifying quantifiable biomarkers for early cognitive impairment is of profound public health importance. Early detection may allow earlier pharmacological interventions when patients may be more responsive to treatments. In addition, early detection would allow patients to make conscious decisions about their situation (personal and property) if their underlying diseases lead to progression to dementia. However, as of now, early detection of cognitive impairment is challenging.
Multiple studies have used structural magnetic resonance (MR) imaging to predict Alzheimer’s disease (
9–
13). Several studies found that local hippocampal and total brain volume are significantly reduced in Alzheimer’s disease and mild cognitive impairment compared with healthy elderly individuals (
14–
23). The hippocampus is affected early, and generally severely, in the Alzheimer’s disease pathological process (
24). Hippocampal volume is the most studied structural biomarker of Alzheimer’s disease and is used in the criteria for its diagnosis (
25). In addition, prediction of conversion from mild cognitive impairment to Alzheimer’s disease has been correlated with the rate and amount of hippocampal, medial temporal lobe, and total brain atrophy (
26–
31).
Biomedical texture analysis aims to quantitatively describe pixel/voxel intensity distributions and the interrelations of pixel intensities across multiple spatial scales. Texture analysis has been used previously in the context of Alzheimer’s disease (
14,
28,
32–
35). Radiomics is an emerging approach to image analysis and refers to high-throughput extraction of quantitative features from radiological images in order to convert images into structured and mineable data (
36–
38). Radiomics pipelines often employ a variety of texture analysis methods to provide a holistic representation of texture-based information of the image or regions of interest in the image. Radiomics-based models have revealed predictive and prognostic associations between images and clinical outcomes (
36–
38). These models offer the potential of capturing often overlooked or hidden information on underlying disease dynamics. Our group has developed a radiomics texture analysis platform that has been used to characterize gene expression patterns of brain cancer (
39,
40), to aid in the diagnosis of head and neck cancers (
41,
42) and breast cancer (
43).
The aim of the present study was to differentiate between three cognitive groups (cognitively normal individuals, individuals with mild cognitive impairment, and individuals with Alzheimer’s disease) and scores on the Clinical Dementia Rating (CDR) scale using MRI-based texture and volume measurements from the hippocampus. We hypothesize that changes in neuropsychological function related to cognitive impairment have a radiological counterpart, detectable via structural MRI. We also hypothesize that texture analysis will be sensitive enough to identify early MRI structural hippocampal changes related to the early Alzheimer’s disease pathophysiologic process, which will be correlated with cognitive groups and CDR scores. Specifically, our objectives are twofold: to use MR radiomics features to differentiate between cognitive groups (cognitively normal, mild cognitive impairment, Alzheimer’s disease) and to predict neuropsychological performance, quantified via CDR scores. The contributions of this study are: identification of MR-derived features that could be used in detecting early cognitive impairment; assessing the use of a granular measure of cognition assessment (such as CDR scores) compared with generic grouping for predictive modeling; and comparing the utilities of volume and texture features in this task.
Methods
ADNI Data Set
Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public–private partnership with the primary goal of testing whether serial MRI, positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment and early Alzheimer’s disease. We selected cases from the shared image collection ADNI-1, a 5-year study with a cohort of 200 cognitively normal individuals, 200 individuals with mild cognitive impairment, and 400 individuals with Alzheimer’s disease (
44). The participants were divided into the assigned cognitively normal, mild cognitive impairment, and Alzheimer’s disease groups and underwent 3-T imaging at the following time points: baseline, 6, 12, 18 (mild cognitive impairment only), and 24 months. We categorized participants into three cognitive groups as assigned by ADNI-1: cognitively normal, mild cognitive impairment, and Alzheimer’s disease. Group specific inclusion criteria are available on ADNI’s website under the General Procedures Manual or under Study Design, Background and Rationale (
45,
46). Briefly, cognitively normal participants have Mini-Mental State Exam (MMSE) scores between 24 and 30 (inclusive) and a CDR of 0, and are non-depressed, without mild cognitive impairment, and non-demented (
45). Participants with mild cognitive impairment have MMSE scores between 24 and 30 (inclusive), a memory complaint, objective memory loss measured by education-adjusted scores on Wechsler Memory Scale Logical Memory II, a CDR of 0.5, absence of significant levels of impairment in other cognitive domains, essentially persevered activities of daily living, and an absence of dementia (
45). Alzheimer’s disease participants have MMSE scores between 20 and 26 (inclusive), CDR of 0.5–2, abnormal memory function documented by scoring below the education-adjusted cutoff on the Logical Memory II subscale (Delayed Paragraph Recall) from the Wechsler Memory Scale, and meet the NINCDS/ADRDA criteria for probable Alzheimer’s disease (
45).
Cognitive Measures
The CDR score is obtained through semi-structured interviews with patients and informants to evaluate six domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care (
47). Patients are then classified on the following ordinal scales: 0 (no impairment), 0.5 (questionable impairment), 1 (mild dementia), 2 (moderate dementia), or 3 (severe dementia). Typically, a score of 0.5 is given to individuals with a diagnosis of mild cognitive impairment (
48,
49).
Study Participants
The initial participant selection criteria were as follows: available CDR score associated with the time of image acquisition and available 3-T T1 scanning protocol to ensure maximum resolution for the image analysis.
We found 204 unique participants in ADNI-1 with available 3-T T1 MR images. Image data were available for all participants at different time points ranging from baseline to month 24. Because we were interested in predicting static cognition levels (CDR scores, cognitive groups), the time point was irrelevant. We selected one time point per participant to ensure unique participants across groups. To maximize group sizes, we first selected participants with a CDR score of 2, who were in the minority. These participants were excluded from all the other groups. Next, participants with CDR scores of 1 and 0.5 were selected. All the remaining participants not assigned to any groups were placed in the CDR 0 group. Individuals with a CDR score of 3 were excluded due to our small sample size. Then, we proceeded to find the 3-T MR scan time points associated with the assigned group labels for participants. The image data acquired at the selected time points were used for analysis. Thirty-one participants in total were excluded. The exclusions were due either to a mismatch between imaging and CDR score acquisition date (N=21) or image unavailability (N=10). This led to a final sample size of 173 individuals: with 67 classified as non-impaired (CDR 0), 48 with questionable cognitive impairment (CDR 0.5), 39 with mild cognitive impairment (CDR 1), and 19 with moderate cognitive impairment (CDR 2).
Demographic and clinical characteristics of the included study participants are presented in
Table 1 and
Table 2. It is noteworthy that to receive a diagnosis of mild cognitive impairment or Alzheimer’s disease, in addition to clinician judgment, intra-individual decline must be obtained with serial cognitive measurements (multiple CDR scores over time) or by a history of change from previously attained levels (
50). Thus, the numbers of participants between cognitive grouping and CDR scores differs.
Image Preprocessing
MR images can have large intensity variations when acquired from different scanners or under different acquisition parameters. ADNI performs several preprocessing steps on magnetization-prepared rapid gradient-echo (MP-RAGE) sequence images. This includes gradwarp geometry distortion correction and B1 and N3 intensity non-uniformity corrections (
51) to ensure comparability of images across devices and protocols. To ascertain the comparability of images across patients, we normalized all images to have a common mean and variance in CSF (
52). Texture and volume analyses were performed using the normalized images.
Texture Analysis
The imaging data were imported into the MIPAV (Medical Image Processing, Analysis, and Visualization) application version 7.2.0 (
53). To avoid resampling the images, we limited the segmentation of the hippocampus to the coronal view since it provided a common pixel spacing of (1.02, 1.02) mm across all patients. Experts identified three slices with the largest possible view of the bilateral hippocampi and manually placed rectangular regions of interest (ROIs) (16×16 pixels) on the area of the hippocampi, while avoiding inclusion of areas outside the hippocampus (
Figure 1A) as much as possible. This segmentation process resulted in six ROIs (3 slices×2 hippocampi) per patient. This segmentation is considered greater than two dimensional and less than three dimensional (often referred to as 2.5D), and it improves the reliability of the sampling process. The ROIs were cropped out of the images and set aside for texture analysis. The individuals who manually placed ROIs on the hippocampi were blinded to the diagnosis; another blinded individual performed quality control checks to ensure ROIs were centrally placed.
Next, we acquired mean, standard deviation, and range of voxel intensities across the ROIs. (subsequently referred to as raw intensity features). We then mapped the dynamic ranges of intensities inside the ROIs to 0–255 as a preprocessing step for characterization of texture. Several statistical and spectral texture analysis methods are included in our radiomics pipeline. Textural features describing patterns or spatial distribution of voxel intensities were calculated from second-order statistical gray level co-occurrence matrices (GLCM) (
54), Laplacian of Gaussian Histogram (LoGHist) (
55), rotationally invariant Discrete Orthonormal Stockwell Transform (DOST) (
56), Gabor filter banks (GFB) (
57), and local binary patterns (LBP) (
58). These methods were implemented in Python programming language using custom-written code and open-source libraries (
59,
60). In total, we extracted 119 features per ROI: three raw intensity, 26 GLCM, 10 DOST, 36 LoGHist, 12 LBP, and 32 GFB features. Extensive details on these features can be found in Ranjbar et al., Patel et al., and Ramkumar et al. (
42,
43,
61) To account for sampling variability, we averaged the features over slice without losing the laterality information, leading to a total of 238 texture features (119 per hippocampus) per patient.
Volumetric Features
We used the volbrain system for computation of hippocampal volumetric measurements. Given a stack of MR images, volBrain (
62,
63) automatically segments parenchyma, brain tissues, macrostructure and subcortical structures (shown in
Figure 1B) and reports volumetric measurements of the structures. For this study, we used two volumetric features for the hippocampus area including relative volume (%) and asymmetry index (%). Relative volume represents the sum of the hippocampi volumes in relation to the volume of the intracranial cavity. The asymmetry index is the difference between right and left volumes divided by their mean.
Statistical Analysis and Machine Learning
Age and sex differences between groups were tested using Student’s t-test and Pearson’s chi-square test, respectively. Statistical significance was defined as a p value <0.05. We performed univariate analysis to compare the difference in texture and volume feature values for both CDR groups and cognitive groups. The p values were adjusted for multiple comparisons using the Benjamini and Hochberg false discovery rate method (
64).
We applied principal component analysis (PCA) to reduce dimensionality of texture features (
65). To maintain interpretability of the principal components, PCA was applied to features stemming from a common texture analysis method. Several comparative datasets were generated with PCA to find the optimal level of variance. The final set of PCs represented 90% of the variance in the original features. Texture PCs combined with volume features were used in supervised classification of two label variables: cognitive groups (cognitively normal, mild cognitive impairment, Alzheimer’s disease) and CDR scores.
Machine learning was conducted using the open-source Python-based package scikit-learn (
66) and custom-written scripts. We used a leave-one-out cross-validation (LOOCV) scheme to predict the labels (
65) and to select features for training. LOOCV iteratively uses all samples except one for model training. In each round, the left-out sample serves as the test case to assess the generalizability of the trained model on an unseen case. In each round, a trained model was generated using features selected by Sequential Forward Feature Selection (SFFS) (
65) and an internal cross-validation (CV). Starting from an empty set, SFFS sequentially added features as long as their addition resulted in CV accuracy improvement of 5%. We used diagonal quadratic discriminant analysis (DQDA) as the classification method (
65). DQDA is a naïve Bayes classifier that allows for diagonal class covariance matrices and has shown to be successful in classification tasks of high-dimensional data with small sample sizes (
67). Several studies have shown that DQDA has comparable or better performance than support vector machine in classification of high-dimensional data (
68,
69).
Our data, by its nature, contained class imbalance, in which dominance of the majority class can hinder the classifier’s ability to learn the inherent properties of each class. To ensure generalizability of the result in experiments with substantial class imbalance, we used an ensemble down-sampling approach coupled with the above-mentioned learning scheme. In each CV round the training samples were divided into majority and minority groups. The majority group was then randomly divided into subsets roughly the same size as the minority group. Each of the subsets was merged with the minority group and served as the training set. The average probability across models for the test sample was used as the probability for that sample. This iterative process allowed every sample in the data set to serve as the left-out sample once.
The area under the receiver operating characteristic curve, sensitivity, and specificity were used to assess classification performance using the open-source software packages R (2.7) (
70) and Scipy (0.15.1, Python 2.7) (
71). The method of DeLong et al. and the pRoc package (
72) were used to estimate the receiver operating characteristic (ROC) curve significance, p values, and 95% confidence intervals (
73). The significance level (p<0.05) is the probability that the observed sample area under the ROC curve is significantly different from the null hypothesis (area=0.5) and is evidence that the model does have an ability to distinguish between the two groups.
Results
The mild cognitive impairment group had a higher proportion of males than the cognitively normal and Alzheimer’s disease groups (Pearson’s χ
2=5.2120, df=2, p=0.02). No significant difference was observed in sex ratio of the other groups. Including sex in models with texture did not impact results. As expected, the age of participants in the CDR 2 group was significantly higher than other CDR levels. Including age in models with volume did not impact results.
Figure 2 compares volume features across groups and CDR scores.
Figure 3 shows the univariate comparison of features across feature groups. Features extracted from left and right hippocampi showed similar significance levels. Increasing the level of variance included in the principal components of texture features did not improve the results.
Prediction of Cognitive Groups
The area under the ROC curves (AUCs) for the classification of cognitive groups is shown in
Figure 4A. Classification reached AUC levels of 0.89 (CI=0.82–0.94) for cognitively normal compared with Alzheimer’s disease; 0.86 (CI=0.79–0.91) for cognitively normal compared with mild cognitive impairment; and 0.70 (CI=0.61–0.77) for mild cognitive impairment compared with Alzheimer’s disease. The performance measures, selected features, and ROC curve analysis for the cognitive groups are summarized in
Table 3. All three models were significant at a p value ≤0.05. Including sex in the models did not affect the results.
Prediction of CDR Scores
The AUCs for the classification of CDR scores is shown in
Figure 4B. The AUC levels of our models were: 0.98 (CI=0.93–0.99) for CDRs 0–2; 0.95(CI=0.9–0.98) for CDRs 0–1; 0.84 (CI=0.76–0.89) for CDRs 0–0.5; 0.73 (CI=0.61–0.83) for CDRs 0.5–2; 0.71 (CI=0.61–0.8) for CDRs 0.5–1; and 0.56 (CI=0.42–0.69) for CDRs 1–2. Overall, models were more successful in classification when the target groups were farther apart on the CDR spectrum. Details of the models’ performance and significance, selected features, and ROC curve statistics for this analysis are present in
Table 4. All models were significant at a p value ≤0.05 except for the classification model CDR 1–2. Relative volume of hippocampi (percent volume) was a predictive feature in two of the six models. We conducted further analysis to assess whether age accounted for the significance of percent volume. When age was included in the model, percent volume remained highly statistically significant (p=0.003), while age was not significant (p=0.35). The AUC only slightly increased from 0.98 (model with percent volume alone) to 0.9910 (model with percent volume and age). A model containing age by itself resulted in an AUC of only 0.785, and the addition of percent volume significantly improved the model fit (p<0.0001). Thus, we conclude that percent volume is meaningful in differentiating between CDR 0 and 2, independent of age.
Discussion
The well-established MR volume features and radiomics texture features had comparable and complimentary utility in classifying cognitive groups and CDR categories. There is ample literature on the utility of imaging features extracted from MRI to assist in clinical diagnosis of probable Alzheimer’s disease. Several investigations have focused on using volume, shape, and other structural MR features in identifying cognitively normal, mild cognitive impairment, and Alzheimer’s disease groups (
10,
13,
18,
26,
28,
30,
74–
78). Texture features have also been used in identifying Alzheimer’s disease (
14,
28,
32–
35,
79). The literature is controversial about exactly what texture captures in the context of Alzheimer’s disease. Sørensen et al. (
14) speculated that texture patterns may provide information on hippocampal function as a result of the significant correlation with [18F]fluorodeoxyglucose-positron emission tomography uptake. The same group also found that hippocampal texture, followed by hippocampal volume, were the most significant features in their algorithm to discriminate cognitive groups (
35).
Our results are consistent with those of Sørensen et al. (
14) For example, when they used only volume to discriminate between ADNI cognitively normal individuals and those with Alzheimer’s disease, they achieved an AUC of 0.91. In our case, we achieved an AUC of 0.89 on this task. Sørensen et al. (
14) also used texture features to differentiate cognitively normal individuals from those with mild cognitive impairment with an AUC of 0.76, comparable to our AUC of 0.86 for the same task.
One technical difference between our methods and those of Sørensen et al. (
14) is that Sørensen et al. resampled MR images in order to have consistency in image voxel size across their cohort. Resampling is often a necessary preprocessing step when images are obtained using different imaging protocols or devices. However, resampling involves interpolation, which can affect the spatial frequency content of the image. In order to establish a reliable baseline for the utility of texture features, we focused on images with a common voxel size in this study. We also used 3-T imaging for higher spatial resolution and contrast-to-noise ratios. Another difference between our work and that conducted by Sørensen et al. is that we used texture features to predict CDR scores. We were able to distinguish CDR 0 (no impairment) from 1 (mild dementia) with an AUC of 0.95. This model used a variety of texture features but not hippocampal volume. On the other hand, volume features alone were able to distinguish CDR 0 from 0.5 (questionable impairment) with an AUC of 0.84. They also were able to distinguish CDR 0 from 2 (moderate dementia) with an AUC of 0.98. Overall, our CDR models performed well at distinguishing cognitively normal people from those with early-stage or questionable cognitive impairment.
Distinguishing between CDR 1 and 2 was the most difficult task in our study, and AUC classification performance was poor, not achieving statistical significance (p=0.46). The transition from mild to moderate impairment appears to be a subtle shift without pronounced discernable changes in texture or hippocampal volume. While texture features suggest that CDR scores and neuropathology may have a relationship early in cognitive impairment (that is, early deposition of amyloid or tau), the lack of discrimination accuracy between CDR 1 and 2 suggests that the pathological depositions may not help in improving classification accuracy. Aisen et al. (
80) posited that the terminology behind mild and moderate Alzheimer’s disease is inaccurate, because the individual has had the disease present for many years. The clinical staging nomenclature infers a clear distinction between various stages, but in reality, the process progresses in a more continuous manner (
80).
As a result of technical limitations of our pipeline, we did not perform three-dimensional segmentation of the hippocampi. Instead, we used a 2.5D segmentation approach in which the hippocampi were segmented on several two-dimensional slices to increase texture sampling. In this approach, we manually placed two-dimensional ROIs on three slices with the largest cross-sectional view of the hippocampus (16×16 pixels). We acknowledge that extracted ROIs may have potentially included immediate anatomical structures such as the entorhinal cortex, resulting in mixed captured signals. In future studies, we plan to replicate the study using an automatic segmentation process.
Small sample size is another limitation of this study (N=173). When divided between CDR groups, each dataset consisted of few samples with a high-dimensional feature space, two known contributors to model overfitting. Due to the lack of sufficient sample size, we did not split the dataset into train and test sets. In order to provide a realistic estimation of model performance and avoid overfitting, we adopted a nested CV scheme for model training and validation and a rather conservative threshold for feature selection (minimum of 5% CV accuracy improvement). Given that our results are comparable to previous studies, we feel confident that the risk of overfitting was mitigated and that the results presented here are generalizable to external data. In the future, we aim to validate this result on larger external datasets. Lastly, the reader should note that we cannot claim the clinical utility of textural biomarkers introduced here since the models were not tested prospectively.
Conclusions
We used existing resources (ADNI-1 data) to introduce a new application of brain MR radiomics using texture analysis and volumetric features in the field of aging, neuropsychiatry, and dementia. Our study findings support the use of brain MR radiomics features for identifying early cognitive impairment, as many features are sensitive to early Alzheimer’s disease pathology. Future studies need to replicate these findings and should examine the clinical utility of MR texture features as Alzheimer’s disease biomarkers. Beyond volume and texture analysis of T1 images of the hippocampus, future applications should expand to incorporate additional data sources. These could include additional MRI contrasts (for example, diffusion tensor imaging), fMRI, and PET. Additional brain structures known to be involved in Alzheimer’s disease progression could also be investigated.