With expansion of Medicaid eligibility and passage of the Affordable Care Act, there is additional pressure on the mental health care system to efficiently and effectively provide mental health assessment and treatment for millions of additional people seeking care. As measurement-based care becomes the standard for assessment of illness severity and improvement with treatment, well-validated, affordable, and quick measures are needed to help busy clinicians treat patients rapidly and effectively.
Computerized adaptive diagnosis (CAD) and computerized adaptive testing (CAT) have the potential to provide rapid, systematic testing on a population level (
1,
2). The paradigm shift between traditional fixed-length tests and adaptive tests is that traditional tests fix the items and allow the measurement precision to vary, whereas adaptive tests fix measurement precision and allow the items to vary. The net result is that it is possible to extract the relevant information contained in a bank of hundreds of symptom-related questions by using only a small number of optimal items for each person. Depending on the application, the degree of required precision can be selected a priori, so that national screening programs can use less precision than clinic screening, which in turn may require less precision than a randomized clinical trial.
Application of CAT differs from standard assessments of symptom severity in several important ways. First, traditional scales may be hampered by a “practice effect,” which results from retaking the same measure repeatedly over time. Because CAT adapts to the current severity level of a patient, these practice effects are eliminated because the patient receives different items each time the test is administered. Second, for repeated assessments, traditional tests make no use of the information contained in the preceding test administrations. By contrast, in CAT, the last CAT-based severity measure can be used to start the next CAT, selecting the next most informative item conditional on the estimated severity level from previous sessions. Third, traditional measurement provides a score (typically the sum of the item scores) but no estimate of uncertainty in the score for a given patient. The standard approach of computing a total score also adds potential bias because items with different numbers of response categories (for example, the Hamilton Rating Scale for Depression [HAM-D-25]) are weighted differently when computing a total score (that is, an item with two categories receives less weight than an item with five categories). Because CAT is based on an underlying statistical model of measurement (item response theory [IRT]), the number of categories no longer differentially weights the importance of the item in computing the severity score, and each estimated score has a corresponding uncertainty estimate. IRT produces the estimate of uncertainty, and CAT mandates that all patients are tested until they achieve a desired level of uncertainty; hence all patients are tested with the same level of precision. Traditional tests lack this desirable statistical property. [An online supplement to this article presents further explanation of CAT and IRT principles.]
It is also important to note that severity measurement and diagnosis are two very different operations. In severity measurement, we seek to maximize information surrounding the symptom severity of the patient. In diagnosis, we seek to maximize information at the threshold above which the probability of the diagnosis exceeds 50%. Gibbons and colleagues (
3) have developed a computerized adaptive diagnostic screener for depression (CAD-MDD). They found that the CAD-MDD could ascertain a diagnosis of major depressive disorder with sensitivity of .95 and specificity of .87 by using an average of four questions and taking less than one minute to administer (mean of 46±29 seconds), making it an exceedingly rapid and effective screener.
If shown to be valid across a wide variety of patient populations, these tools could fill a key void, allowing automated testing of millions of people with a quick, easily administered online tool. Standard scales such as the nine-item Patient Health Questionnaire (PHQ-9) have been validated in a wide range of treatment settings. The CAT for depression severity (CAT-DI), the CAD-MDD (depression diagnostic screener), and CAT-ANX (anxiety severity) have been validated previously in both an academic and a nonpsychiatric community hospital. To assess the validity and potential impact of these tests on general outpatient community psychiatric practice, as well as to provide initial validation of the CAT-MANIA (mania severity), we sought in this study to validate the utility of the CAT-MH (mental health) suite of tests in a nonacademic, community sample of adult psychiatric outpatients.
Methods
Item Bank and Original Calibration Sample
The original studies developed a 1,008-item question bank consisting of 452 depression items, 467 anxiety items, and 89 mania items (
1–
4). Separate CATs were developed for each of these three primary domains. The items were selected on the basis of a review of more than 100 existing depression or depression-related rating scales, with most items modified to refer to the previous two-week time period and self-rated on a 5-point ordinal scale. These tools and methods have been described in detail elsewhere (
1–
11) and have been previously validated in an academic center (University of Pittsburgh psychiatric clinics) and a nonpsychiatric community general medical hospital (DuBois Regional Medical Center).
Validation Sample
The VOCATIONS trial (Validation of Computerized Adaptive Testing in an Outpatient Nonacademic Setting) was designed as a prospective cross-sectional validation study of the CAT-MH suite of tests and was conducted between April 18, 2012, and March 29, 2013, at the outpatient clinics of Pine Rest Christian Mental Health Services, located in Grand Rapids, Michigan. Pine Rest is a large, not-for-profit, free-standing psychiatric system with a spectrum of comprehensive psychiatric services ranging from inpatient to partial hospitalization, including a network of outpatient clinics in the surrounding community. In the population served by Pine Rest outpatient clinics, 64% of patients have a commercial insurance plan, 12% are self-pay, 12% are covered by Medicare, and 12% are covered by community mental health contracts (that is, uninsured) or by Medicaid. This study was conducted in compliance with the ethical principles of the Declaration of Helsinki, the U.S. Food and Drug Administration guidelines, and the International Conference on Harmonization’s Good Clinical Practices Guidelines. The Human Participants Review Board at Mercy Health Saint Mary’s approved the study, and individuals signed a written informed consent form prior to initiation of any study procedures.
Participants were a convenience sample of women and men, ages 18–70, who presented to Pine Rest Christian Mental Health Services clinics seeking care and a control sample of adults with no current or past history of a mental disorder. Participants were recruited using institutional review board–approved advertisements in clinic waiting rooms and on the Pine Rest Web site. Patients had to be willing and able to provide written informed consent in order to participate. Exclusion criteria were schizophrenia, schizoaffective disorder, or other psychotic disorder; organic mood disorder due to a general medical condition or a substance use disorder; drug or alcohol dependence in the prior three months; severity of illness sufficient to require inpatient hospitalization because of suicide risk or psychosis; and Alzheimer’s or Parkinson’s disease.
Upon signing informed consent, participants were administered the following assessments by trained raters blinded to the patients’ clinical diagnoses prior to evaluation: Structured Clinical Interview for DSM-IV-TR (SCID) (
12), the HAM-D-25 (
13), PHQ-9 (
14), Center for Epidemiologic Studies Depression Scale (CES-D) (
15), Global Assessment of Functioning (GAF) (
16), a questionnaire about demographic characteristics, and a study participation evaluation. Participants also took the most recent version of the CAT-MH, which contains the depression, anxiety, and mania-hypomania components of the entire 1,008-item bank, including the CAD-MDD for current depression diagnosis, CAT-DI for current depression severity, CAT-ANX for current anxiety severity, and CAT-MANIA for current manic-hypomanic symptom severity. CAT-MH depression, anxiety, and mania scores were correlated with SCID, HAM-D-25, CES-D, and PHQ-9 scores and with
DSM-IV-TR cases of depression, anxiety, and bipolar disorders.
Statistical Methods
Sample size computations were conducted to determine the ability to find significant differences in sensitivity and specificity between the original findings for the CAD-MDD and the results of this validation study. Assuming a type I error rate of 5% and power of 80%, N=150 permits detection of approximately 10% differences in sensitivity (.95 versus .86) and specificity (.87 versus .75).
Data analysis was performed by the senior author (RG) at the University of Chicago. The goal was to test the reproducibility of previous analyses of sensitivity, specificity, and correlation with gold-standard symptom severity scales (HAM-D, CES-D, and PHQ-9) in this community sample. Logistic regression was used to examine relationships between severity scores and the presence or absence of DSM-IV-TR diagnoses.
Results
Participants
A total of 150 patients provided written informed consent. Four did not meet inclusion criteria, and one withdrew consent. A total of 145 patients completed all testing and were included in the analysis. [A CONSORT diagram in the online supplement provides additional details on sample recruitment.]
Patient Demographic Characteristics
Of the 145 adult patients in the sample, 79% were female, 10% were Hispanic, 90% were Caucasian, 5% were African American, 3% were Asian, and 3% indicated other race. In addition, 58% were married, 24% were never married, 5% were living with a partner, and the remainder were divorced (10%), separated (2%), or widowed (<1%). In terms of education, 40% had a college degree or higher, 42% had some college, and 16% had graduated from high school or had a GED (
Table 1).
Diagnoses
In terms of current
DSM-IV-TR diagnoses, 27 of the 145 patients had major depressive disorder, 27 had generalized anxiety disorder, 13 had bipolar I disorder, 11 had bipolar II disorder, 15 had dysthymic disorder, and 16 had panic disorder. Other diagnoses are shown in
Table 2. Many patients had comorbid disorders, which explains why the sum of diagnoses exceeds the sample size. Nineteen of the 145 participants had no current or past history of a
DSM-IV-TR diagnosis (control group).
CAD-MDD: A Diagnostic Screen for Major Depression
Given the high degree of pathology and comorbidity in the sample, it was expected that the high sensitivity seen in other studies would be replicated, but with lower specificity. This was found in the overall sample, where sensitivity was .96 (.95 in the original CAD-MDD study) and specificity was .64 (.87 in the original CAD-MDD study (
3), which included a much greater number and proportion of individuals with no current or past
DSM-IV-TR diagnoses). However, when the sample was restricted to patients meeting
DSM-IV-TR criteria for major depressive disorder in the past month and individuals with no current or past
DSM-IV-TR diagnoses, sensitivity remained at .96, but specificity increased to 1.00 (that is, there were no false positives and only one false negative in a total of 46 patients). These results are consistent with what would be expected in a primary care setting, where the majority of patients would not meet criteria for a
DSM-IV-TR major depressive disorder (
17,
18). These results were achieved with an average of 4.1 questions, which took 36.1 seconds to complete.
CAT-DI: Depression Severity Measure
The dimensional measure of depressive severity (CAT-DI) demonstrated correlations with traditional scales, such as the HAM-D-25 (r=.79), PHQ-9 (r=.90), CES-D (r=.90), and GAF (r=–.70) (
Table 3). The CAT-DI correlated highly with the CAT-ANX (r=.82) but less so with the CAT-MANIA (r=.38). In terms of its relationship with current
DSM-IV-TR major depressive disorder diagnosis, the CAT-DI had an odds ratio (OR) of 6.97 (p<.001). This means that for every unit increase in CAT-DI score, the likelihood of a current
DSM-IV-TR major depressive disorder diagnosis increased sevenfold. Given that the range of scores on the CAT-DI is from –2 to 2, the actual span gives an OR of 27.88, a 28-fold increase in probability of major depressive disorder from the low to the high end of the CAT-DI scale. This scale took an average of 16.8 items and 3.4 minutes to complete.
CAT-ANX: Anxiety Severity Measure
The dimensional measure of anxiety severity (CAT-ANX) demonstrated correlations with traditional scales, such as the HAM-D-25 (r=.73), PHQ-9 (r=.78), CES-D (r=.81), and GAF (r=–.68) (
Table 3). These results indicate that depression and anxiety have considerable overlap, which is known to be true neurobiologically and is also observed clinically (
19). The CAT-ANX correlated highly with the CAT-DI (r=.82) but less so with the CAT-MANIA (r=.47). In terms of its relationship with current
DSM-IV-TR generalized anxiety disorder diagnosis, the CAT-ANX had an OR of 2.88 (p<.001). Given that the range of scores on the CAT-ANX is from –2 to 2, the actual span gives an OR of 11.52, a 12-fold increase in probability of generalized anxiety disorder from the low to the high end of the scale. This scale took an average of 12.9 items and 2.0 minutes to complete.
CAT-MANIA: Mania Severity Measure
The dimensional measure of the hypomania-mania spectrum (CAT-MANIA) demonstrated relatively low correlations with traditional scales, as expected: HAM-D-25 (r=.31), PHQ-9 (r=.37), CES-D (r=.39), and GAF (r=–.29) (
Table 3). These results indicate that depression and mania have limited overlap, at least at a single point in time, which has been confirmed clinically: depressive and manic symptoms often co-occur, but true mixed states as defined by
DSM-IV-TR are uncommon (
20,
21). The CAT-MANIA correlated minimally with the CAT-DI (r=.38) and the CAT-ANX (r=.47). In terms of its relationship with current
DSM-IV-TR bipolar diagnoses (bipolar I disorder, bipolar II disorder, and bipolar disorder not otherwise specified [NOS]), the CAT-MANIA had an OR of 2.89 (p<.002). Given that the range of scores is from –2 to 2, the actual span gives an OR of 11.56, a 12-fold increase in probability of a bipolar disorder diagnosis from the low to the high end of the CAT-MANIA scale. This was the first time the CAT-MANIA had been validated in a clinical sample. This scale took an average of 17.9 items and 3.4 minutes to complete.
Patient Impressions of Usability of the CAT-MH
Participants took, on average, 51.7 items and 9.4 minutes to complete the entire CAT-MH. As summarized in
Table 4, patients found the computerized adaptive tests easy overall and acceptable to use, felt comfortable answering personal questions about themselves, answered them honestly, preferred computerized adaptive tests over a pencil-and-paper test, and felt the test accurately reflected their mood. There was some concern that older patients would not find the computerized test as easy to take. This was not found to be the case; correlations to age ranged from .22 to .35.
Discussion
This was the first prospective, cross-sectional study to validate the CAT-MH suite of tests, including the CAT-MANIA scale, in a community outpatient psychiatric setting against gold-standard diagnostic and severity measures, including the SCID for DSM-IV-TR, HAM-D-25, CES-D, PHQ-9, and GAF.
Considering the high rate of DSM-IV-TR disorders in this clinic sample, the high rate of comorbidity, and the small number of individuals with no current or past DSM-IV-TR diagnoses, the CAT-MH performed well. Sensitivity remained at high levels and specificity decreased as expected. However, when the sample was restricted to patients with confirmed major depressive disorder and those with no current or past DSM-IV-TR diagnoses, sensitivity for the CAD-MDD was unchanged, but specificity increased to 1.00 (that is, no false positives). Of 46 participants, there was only one misclassification. This bodes well for applications in primary care, where most patients (90% or more) will not have a current DSM-IV-TR major depressive disorder.
Even though the sample was of a patient cohort with multiple diagnoses, the three severity tests also performed well. Significant relationships were found to DSM-IV-TR diagnoses of major depressive disorder, generalized anxiety disorder, and current bipolar disorders for each of the three dimensional measures (CAT-DI, CAT-ANX, and CAT-MANIA, respectively), and the CAT-DI was strongly related to traditional depression severity measures. In general, patients appeared to have a positive overall impression of the test, were comfortable answering questions using a computer interface, found it easy to use, reported answering honestly, and indicated that the questions accurately reflected their mood. Interestingly, 86% indicated that they preferred the computer interface to a traditional paper-and-pencil test.
The strengths of this study included the prospective nature of the evaluations, the broad inclusion criteria that improved generalizability, and the use of gold-standard diagnostic and symptom severity comparators. Limitations included its cross-sectional design that did not allow for test-retest and longitudinal assessment of improvement over time. Given the adaptive nature of the testing and the large question bank from which to draw unique questions, we would expect that these assessments would be superior to standard assessments for longitudinal follow-up and would avoid the potential bias of the practice effect, but this needs to be demonstrated in future studies.
A further limitation of these assessments was the inability to detect lifetime history of psychiatric disorders. For example, longitudinal data are required for the accurate diagnosis of bipolar disorder, whereas the CAT-MANIA scale is useful only in assessing current manic symptoms. Per the SCID for DSM-IV-TR, there were 13 participants with current manic symptoms that met full criteria indicative of bipolar I disorder, 11 with hypomania indicative of bipolar II disorder, and one with current bipolar disorder NOS. When lifetime episodes of mania or hypomania were taken into account by assessment with the SCID for DSM-IV-TR, a total of 20 patients in this cohort had bipolar I disorder, 20 had bipolar II disorder, and three had bipolar disorder NOS (
Table 2). This finding is critical, because if those patients (with mania currently in full remission [N=8] or partial remission [N=10]) were incorrectly diagnosed as having unipolar depression, they may have received inappropriate treatment with antidepressants, rather than with mood stabilizers; antidepressants may be ineffective for the treatment of bipolar disorder (
22–
25).
Finally, the sample, in which 79% were women, 90% were Caucasians, 40% had a college degree, and 42% had some college, is not representative of other, more diverse patient populations. Future testing in these populations is required.
Conclusions
The results of this prospective, cross-sectional validation study suggest that the CAT-MH suite of tests provides a rapidly administered, accurate assessment of depression diagnosis and symptom severity across a broad range of mood and anxiety symptoms in an adult, community outpatient psychiatric population.