A lzheimer disease is a growing public health problem,
1,
2 and its prevalence is increasing rapidly with the aging of the baby boomer generation.
3 Early diagnosis and treatment can reduce the burden this increase poses to the health care system and society.
4 However, Alzheimer’s disease is often underrecognized in community clinical practice settings
5 –
7 because the diagnosis can be difficult
8 and may require specialized training. Without fast and reliable screening instruments, it may be difficult for primary care physicians to identify patients who should be referred for a more comprehensive dementia workup.
The clock drawing test is widely used as a screening test for dementia. Neurologists have used clock-drawing and time telling tests extensively.
9 Several factors contribute to the test’s popularity, including administration and scoring ease and evaluation of multiple cognitive domains,
10,
11 such as executive functioning.
11 –
13 Compared to the Mini-Mental State Examination (MMSE), the clock drawing test is thought to have less educational bias
14 and is better able to detect cognitive decline due to Alzheimer’s disease and other dementias.
15 The clock drawing test has also been advocated over the MMSE as an office screening test for dementia in community clinics
4 and in acute hospital settings.
16 Furthermore, the clock drawing test is suitable for non-English speaking populations.
14There are two general clock drawing test scoring approaches, including qualitative and quantitative, and varied scoring systems that emphasize different facets of the clock drawing process. Early quantitative scoring systems were validated to distinguish between subjects with moderate or severe Alzheimer’s disease and cognitive healthy comparison subjects and later adapted for use in mild cognitive impairment and mild Alzheimer’s disease.
17 –
23 Previous studies of objective clock drawing test rating systems identified Alzheimer’s disease with overall diagnostic accuracy ranging from 59% to 85%.
24 However, such diagnostic accuracy has not been found in mild cognitive impairment cohorts with sensitivities ranging from 17% to 92%.
24 In a retrospective study comparing several clock drawing test scoring systems,
24 the scoring system by Mendez et al.
19 has been found to be the most accurate in distinguishing demented from nondemented individuals, followed closely by the Consortium to Establish a Registry for Alzheimer’s Disease (CERAD) system.
25Though diagnostically useful, quantitative clock drawing test rating schemes
18 –
20,
22,
23 are rarely used in clinical settings, as they take more time and require trained clinical personnel to score. Moreover, when using the clock drawing test to identify dementia, qualitative ratings of naive judges may be equal to or more accurate than many quantitative scoring systems.
24 Though it is widely assumed that dementia specialists are more reliable and valid than naive raters, there are no known studies that have evaluated the psychometric properties of clock drawing test ratings made by trained clinicians who use the clock drawing test as part of their regular clinical practice. Our present study was performed to determine the interrater reliability, sensitivity, and specificity of qualitative clock drawing test ratings made by clinicians specializing in the assessment of patients with dementia. Two qualitative rating approaches were utilized: a dichotomous rating of impaired versus nonimpaired and a 0–10 ordinal rating scale. A multidisciplinary consensus conference was the gold standard for dementia diagnosis in the current study.
METHODS
Participants
Archival data were extracted from the Boston University Alzheimer’s Disease Core Center registry, which is an institutional review board approved National Institute on Aging-funded Alzheimer’s disease registry,
26 –
28 that longitudinally follows older adults with and without memory problems. Participants performed the clock drawing test as part of an annual neurological and neuropsychological exam. All participants were at least 55 years old, were English-speaking community dwellers with no history of major psychiatric or neurological illness or head injury involving loss of consciousness, and had adequate auditory and visual acuity to complete the examination. After data query, there were 506 eligible participants in the Boston University Alzheimer’s Disease Core Center patient/comparison registry who had been diagnosed by a multidisciplinary consensus team (including at least two board-certified neurologists and two neuropsychologists) based on a clinical interview with the participant and an informant, medical history review, and neurological and neuropsychological examination results.
Of the 506 participants, 168 were diagnosed as cognitively normal comparison subjects, 39 as cognitively normal comparison subjects with cognitive complaints reported by self or study partner (worried comparison subjects), 88 as “probable” mild cognitive impairment patients,
29,
30 106 as “possible” mild cognitive impairment patients (no complaint of cognitive decline, but with objective impairment on one or more primary neuropsychological variables), 55 as probable Alzheimer’s disease patients, and 50 as possible Alzheimer’s disease patients.
31 Participants diagnosed as cognitive comparison subjects (with or without complaints) were included if they had a Clinical Dementia Rating of 0,
32 and an MMSE score ≥26
33 and if they were not impaired on any primary neuropsychological test variable (i.e., no scores fell more than 1.5 standard deviations below normative means). Exclusion criteria included dominant hand hemiparesis or other central or peripheral motor impairments or visual acuity impairment that would preclude clock drawing test completion. The current study utilized the participants’ most recent registry visit data.
Procedures
Trained psychometricians administered the clock drawing test in a standard way to include command (i.e., “I want you to draw the face of a clock, putting in all the numbers where they should go, and set the hands at 10 after 11”) and copy conditions.
27 Participants were allowed to make corrections and make attempts to draw the clock a maximum of two times. Only the command condition data were used for the current study.
For the purpose of this study, 25 command clocks were randomly selected from each of the six diagnostic strata described above, resulting in the inclusion of 150 clocks from 150 different subjects with 50 clocks equally divided among each of the three primary diagnostic groups (i.e., comparison, mild cognitive impairment, Alzheimer’s disease). The clocks were rated independently by four board-certified neurologists and a neurology nurse practitioner, all of whom specialize in dementia. Raters were blinded to participant diagnostic and demographic information. Ratings were made on a binary (normal/abnormal) and an ordinal scale (0–10 rating, where 0 signified no impairment and 10 signified complete impairment). The ordinal ratings were examined for potential outliers, and when widely discordant interexaminer ratings (score differences >5; arbitrarily selected) were found, these clocks were rerated by the original raters. The rerated clock drawing test scores were used in the analyses.
Statistical Analyses
All analyses were performed in SPSS 16 (Chicago) or SAS 9.1 (Cary, NC) software. A one-way analysis of variance (ANOVA) assessed for between-group differences in age, education, MMSE score, and Geriatric Depression Scale score. Significant findings were followed up with Tukey-Kramer post hoc tests to determine specific group differences. Gender and racial differences among the groups were analyzed using the chi-square test of independence.
Estimates of rater agreement were calculated for both the binary and ordinal rating scales. For the binary ratings, kappa statistics were computed.
34 –
36 Because there were five raters, the multiple agreement function in SAS was used for kappa calculations for the binary ratings. For the ordinal ratings, Kendall’s intraclass correlation coefficient (ICC) of concordance was computed using the intraclass correlation functions in SAS, and Spearman rank correlations were generated between individual raters. Agreement for ordinal ratings were evaluated by calculating the absolute score differences between raters.
To examine the diagnostic utility (based on the three primary diagnostic groups) of the clinicians’ ratings, sensitivity, specificity, and positive likelihood ratios were calculated for both the dichotomous and ordinal ratings. For the dichotomous ratings, two methods were used to summarize the five raters’ data. First, we calculated individual sensitivity and specificity statistics for each rater and then created an average for all five raters. We also created a single summary rating of “impaired” versus “intact” based on whether the majority of the raters rated the clock as impaired or intact. For the ordinal ratings, an average was calculated to summarize the five clinicians’ ratings; then sensitivities and specificities were calculated for this average rating against the true diagnostic classification. The ordinal ratings were used to examine the cutoff score that optimized sensitivity and specificity.
RESULTS
Details of demographics and clinical characteristics are shown in
Table 1 . There were significant between-group differences for age, female gender, education, and MMSE. There were no significant between-group differences in the Geriatric Depression Scale score.
The clinicians’ interrater reliability was “almost perfect”
34 –
37 for the ordinal (ICC=0.92) system and “substantial”
37 for the dichotomous system (k=0.85). The absolute difference scores between ordinal ratings are presented in
Table 2 . The five clinicians’ ratings did not differ by more than three ordinal scale units for 69% of the clocks. An absolute score difference of 5 or less was observed in 91% of the clocks, and only five clocks (3%) had an absolute difference of more than seven units on the ordinal scale. Examples of participant clocks that were rated similarly and dissimilarly are presented in
Figure 1 . Spearman rank correlations between each of the individual raters ranged from 0.64 to 0.82 for the ordinal scale (
Table 3 ).
Based on the five clinicians’ average dichotomous rating, the clinicians differentiated comparison and Alzheimer’s disease participants with a sensitivity of 0.75 and a specificity of 0.81. In comparison, the dichotomous rating based solely on the majority of raters had a sensitivity of 0.84 and specificity of 0.84. In differentiating comparison subjects from mild cognitive impairment participants, the average sensitivity of the five raters’ dichotomous classifications was 0.47, and the average specificity was 0.81. Using a majority for the calculation of sensitivity and specificity resulted in values of 0.50 and 0.84, respectively (
Table 4 ).
The sensitivities, specificities, and positive likelihood ratios for several cutoff scores on the ordinal scale are presented in
Table 5 . For three of the four comparisons (i.e., Alzheimer’s disease versus comparison, mild cognitive impairment versus comparison, and Alzheimer’s disease + mild cognitive impairment versus comparison), a cutoff score of two or greater resulted in the maximization of sensitivity and specificity for differentiating diagnostic groups. For differentiating Alzheimer’s disease from mild cognitive impairment, a cutoff score of four or greater was the rating that maximized sensitivity and specificity. As can be expected, higher ordinal ratings (i.e., more impaired clocks) were associated with a greater likelihood of being diagnosed with either Alzheimer’s disease or mild cognitive impairment but at the expense of sensitivity.
DISCUSSION
Our study sought to investigate the interrater reliability of qualitative clock drawing test ratings made by five dementia clinicians at Boston University Medical Center. The clinicians were reliable clock drawing test raters using both dichotomous (impaired versus intact) and ordinal (0–10 impairment scale) ratings. The interrater reliability for the dichotomous system achieved a kappa of 0.85, and the ordinal rating resulted in an intraclass correlation coefficient of 0.92. These statistics represent excellent interrater reliability values and are comparable to those obtained in our recent work comparing several widely used quantitative clock drawing test scoring systems.
27 Our findings demonstrate that in the absence of objective scoring methods, the clock drawing test can be rated reliably across a cognitive severity spectrum by clinicians who specialize in dementia.
Despite these excellent reliability values, there were several individual instances in which clinicians’ ratings were widely disparate. As seen in
Table 2, ratings of nine clocks (6%) differed by six or more points on the ordinal scale after rerating eliminated errors. There are multiple factors that may explain why the clinicians applied disparate ratings, including spatial configuration, participant self-corrected errors, and shape of the clock face as exemplified in
Figure 1 . The discrepancy among the raters highlights the difficulty that clinicians face when scoring clocks subjectively.
The present study also examined the accuracy of clinician-rated clock drawing tests in differentiating among cognitively normal, mild cognitive impairment, and Alzheimer’s disease diagnostic categories. Despite the substantial
37 overall agreement between raters, the results demonstrate that the accuracy with which qualitative ratings can differentiate diagnostic group membership was less robust. Although Alzheimer’s disease patients and comparison subjects could be differentiated with a relatively high degree of accuracy, the ratings were considerably less useful when making the distinction between a diagnosis of mild cognitive impairment and comparison (less sensitive) or Alzheimer’s disease and mild cognitive impairment (less specific). Therefore, while the clock drawing test may be a good screening instrument for Alzheimer’s disease, it may not be a sensitive instrument for screening mild cognitive impairment, especially if clinicians use a dichotomous rating. When screening for mild cognitive impairment, the presence of an abnormal clock drawing test in isolation (based on subjective clinician rating) may result in a large number of false positive or false negative errors. For the mild cognitive impairment diagnosis, the sensitivity and specificity were somewhat improved by using a subjective ordinal rating scale with three or more cutoff points as compared to the dichotomous scale. We therefore suggest using a 3-point subjective ordinal clock drawing test rating scale such as “normal,” “suspicious,” and “impaired” to improve the mild cognitive impairment diagnosis rather than the existing dichotomous system.
The clinicians who served as raters for the current study are specialists in diagnosing dementia and work in a tertiary care clinical setting and research center. Therefore, these clinicians may represent a more reliable and diagnostically accurate group than nonspecialists in the community. Their expertise in dementia assessment may limit the extent to which the findings can be generalized to other settings and clinicians. Another limitation is that some clinicians were also members of the consensus team that formulated the original diagnostic impressions for our participant cohort. This overlap raises the possibility that the clinicians may not have been completely blinded to diagnostic group membership for the clocks being reexamined, assuming that the clinicians remembered the clocks that were presented in prior consensus conference meetings. However, this overlap would have only impacted the diagnostic utility statistics and not the interrater reliability, which was the primary focus of the current study. We excluded individuals with dementia other than Alzheimer’s disease, visual impairment and non-English speakers, which may have increased the diagnostic utility statistics while limiting the generalizability of our study.
Although the clock drawing test has many advantages as a screening instrument in the assessment of patients with suspected dementia, it is often used qualitatively, or subjectively, in clinical settings. As such, the reliability of these qualitative ratings between clinicians is brought into question. This is the first study to investigate the concordance among clock drawing test ratings by dementia specialists. The current study results indicate that dementia specialists can reliably rate clock drawing test performance using two different qualitative rating approaches. In contrast, the findings do not support the use of the clock drawing test as a stand-alone screening instrument, as the classification accuracy statistics presented suggest that in mild cognitive impairment, the clinician ratings may be susceptible to both false positive and false negative errors. However, the clinicians’ ratings had excellent sensitivity and specificity for distinguishing healthy comparison subjects from those with probable and possible mild Alzheimer’s disease. Future studies should compare the reliability and diagnostic accuracy of qualitative methods to empirically validated quantitative scoring systems.
Acknowledgments
This research was supported by National Institute of Health grants P30-AG13846 (Boston University Alzheimer’s Disease Core Center), M01-RR00533 (Boston University General Clinical Research Center), K24-AG027841 (RCG), K23-AG030962 (Paul B. Beeson Career Development Award in Aging; ALJ). The authors thank Sabrina Poon, Melissa Barrup, Laura Byerly, Sita Yermashetti, Pallavi Joshi, Mario Orozco, Amanda Gentile, and Kristen Huber for their assistance with data and all the psychometricians, nurse specialists, and clinicians at the Boston University Alzheimer’s Disease Center and GCRC for administering the clock drawing task and providing clinical assessments. In particular, the authors thank the participants of the Boston University Alzheimer’s Disease Center cohort.
A portion of this work was presented in poster format at the American Academy of Neurology Annual Meeting, April 14-16, 2008, in Chicago. None of the authors have any conflicts of interest to disclose. There are no commercial associations that might pose or create a conflict of interest with the information presented in the submitted manuscript.