Fifty-nine articles met criteria for inclusion. A complete list of the articles and a 25-page evidence table describing the studies and summarizing their findings is available from the first author. Also available is a companion glossary of the instruments reviewed in the articles. Thirty-nine of the 59 articles (66 percent) were validation studies, 13 (22 percent) were reviews, and seven (12 percent) were outcome studies.
Of more than 40 instruments mentioned in the articles, only five were mentioned in 20 percent or more of the studies. They were the Geriatric Depression Scale (GDS) (
36), mentioned in 19 articles; the Beck Depression Inventory (BDI), mentioned in 16 articles; the General Health Questionnaire (GHQ) (
37) and the short version of the Zung Depression Scale (SDS) (
38), both mentioned in 15 articles; and the Center for Epidemiologic Studies Depression Scale (CES-D) (
39), mentioned in 12 articles. The BDI, GDS, and SDS are depression-specific self-administered instruments. The GDS is intended for older persons. The CES-D is a subscale of a larger, population-based research screening tool. The GHQ is not specific for depression.
Review articles
The 13 literature reviews were of three general types: users' guides, which consisted largely of expert opinion about the strengths and weaknesses of different instruments; quantitative comparisons and meta-analyses, which summarized studies that included statistical measures of validity against specific criterion instruments; and reviews of evidence, which examined whether using screening instruments produced favorable outcomes. The articles by Applegate and associates (
41), Gallagher (
42), Leserman and Koch (
43), Kavan and colleagues (
44), and Van Gorp and Cummings (
45) typify users' guides. Leserman and Koch made a unique contribution by summarizing the studies showing instruments' sensitivity to change with treatment of depression.
Quantitative comparisons varied in sophistication. Mulrow and her co-investigators (
46) used a computerized literature search with defined criteria to identify studies systematically, focusing on research comparing instruments to criterion standards in primary care settings. Studies were evaluated based on whether the criterion assessment of patients was independent of screening and whether a large proportion of subjects were both screened and diagnosed. The review includes a meta-analysis of sensitivities and specificities of the instruments included.
The review by Coulehan and colleagues (
47) produced similar results to that by Mulrow, although with less rigorous analysis of the original research. Sensitivity and specificity measures cited in Coulehan's review appeared to be higher than those in Mulrow's review. Both reviews examined the BDI, CES-D, GHQ, and SDS.
Clark and Watson (
48) explored the reasons depression screening instruments often lack specificity. Using meta-analytic techniques, the authors identified a nonspecific distress factor that is present to some degree in measures of both depression and anxiety and that confounds screening scales.
The reviews of evidence for effectiveness disagreed in their conclusions. Feightner and Worrall (
49) identified only four studies of effectiveness of early intervention using screening and concluded that instruments have been shown only to increase detection of depression, with no documented impact on outcome. These authors advised caution in screening general populations of primary care patients for depression. Zung's analysis (
50) did not include effectiveness studies but did cover six studies demonstrating that offering physicians knowledge of the results of depression screening tests increased their recognition of depression. Zung suggested that depression screening instruments be used as a “depression thermometer,” monitoring a patient's mental state in the same way a thermometer tracks a physical sign of health status.
Validation studies
There is no obvious single classification system to describe the mixture of 39 papers identified under the general heading of validation studies. Several classification approaches may be informative, although not necessarily exhaustive. Three are described here.
One way of classifying the studies is by type of comparison. Two major groups were noted—those comparing an instrument to a criterion measure, such as a DSM-III diagnosis (27 studies), and those comparing clinical judgment or diagnosis to a criterion measure (seven studies). The few studies that did neither were wide ranging, from those comparing an instrument to a subset (for example, Andresen and Malmgren [51]); an instrument to itself (for example, Burrows and associates [52]); or an instrument to measures that are not depression specific (for example, Wilkinson and Barczak [53]).
Studies comparing an instrument to a criterion measure test validity in the classic sense—if the instrument is in agreement with the criterion measure, it can be said to measure what it purports to measure, and therefore is valid, at least under the conditions tested. The circumstances, population, and definition of “agreement” are critical to such validity studies. As described below, the studies differed markedly in each of these areas. For example, five of the studies that compared the instrument and a criterion dealt with general populations. Of those, four reported sensitivity and specificity as the measure of agreement, while one reported area under the receiver operating curve, a summary measure of screening instrument performance in comparison to a criterion.
The studies of clinicians' judgment were a diverse group. Three studies—by Pond and associates (
54), Coyne and colleagues (
55), and Gerber and coworkers (
56)—documented relatively poor performance of nonpsychiatrist physicians when their judgments were compared with criterion standards. Spitzer and colleagues (
28) found high specificity but only moderate sensitivity of primary care physicians in detecting depression with an instrument-guided inquiry.
Studies may also be classified according to the population used. Such classification is important because instruments may be optimized for specific groups or because of concern that performance may suffer in particular groups. Twenty-two studies dealt with elderly patients, four specifically comparing cognitively normal groups and impaired groups. Nine focused on more general populations, mostly patients in ambulatory settings. Considered as a group, these papers suggested that several screening instruments, including ones not intended for detecting depression or not specific for depression, are nearly equivalent in measured validity against criterion standards. In general populations, clinicians' judgment did not perform as well as standardized approaches, as noted above.
A third way of looking at these studies is by method of analysis. Two statistical approaches appeared to be prevalent. Eleven of the studies presented results as a receiver operating curve (ROC), plotting true versus false positive rates at different cutoff scores of an instrument. These studies often included calculation of the area under the ROC as a summary measure of the instrument's performance. Sixteen papers reported sensitivity and specificity of instruments at stated cutoff scores. This approach often included calculation of positive and negative predictive values in the population studied. The remainder, which reported neither ROC nor sensitivity-specificity, employed a broad range of other techniques such as kappa scores, correlation coefficients, and rates of agreement.
Extensive discussion of the properties of the different analytic measures is beyond the scope of this review. Bailar and Mosteller (
57) have briefly described the basic statistical methods, and Somoza and colleagues (
58) presented a detailed explanation of the application of ROC methods to validation of depression screening tests.
Finally, one issue raised in five different studies is the minimum number of questions needed for efficient depression screening. Berwick and colleagues (
59) reported that carefully selected short subsets of the Mental Health Inventory (MHI) (
60) performed nearly as well as the complete instrument in detecting patients with major depression. Indeed, Berwick's team was able to identify a single item that worked comparably to the entire MHI. Broadhead's group (
40) studied a set of just four questions and found sensitivity and specificity comparable to published validation studies of established, and much longer, instruments at their optimum cutoff points. Steer and his coworkers (
61) identified two symptoms from the BDI that distinguished between anxious and depressed patients almost as well as the entire instrument.
Rost and associates (
62) found that two-item subsets of the Diagnostic Interview Schedule (DIS) had 99 percent negative predictive value in three ECA populations when compared with the full DIS. Wyshak and Barsky (
63) tested a single question using both physician and patient ratings and found that its performance was comparable to longer instruments, at least in the special population studied.
Outcome studies
A well-designed outcome study showing significant benefit to persons screened, identified, and treated for depression would be important evidence in a decision to implement routine depression screening. None of the seven studies in this review met that standard. Most dealt with recognition—whether physicians given results of screening would better recognize depression—and treatment, whether persons screening positive would be more likely to receive treatment for depression.
Results of recognition studies were mixed. Iliffe and colleagues (
64) typified this approach, showing that use of screening instruments led to increased recognition of depression in one of two practices studied. Gold and Baraff (
65) showed that providing GHQ scores to emergency physicians increased recognition of depression and referral for psychosocial services. Magruder-Habib and colleagues (
66) also documented increased recognition and treatment if physicians were provided SDS scores. On the other hand, Shapiro's group (
67) found that although screening did result in improved recognition for at least some patients, it did not lead to increased medical management of depression.
Two studies dealt with longer-term effects of screening. Berwick and associates (
68) discovered that patients in a health maintenance organization who scored high on the GHQ were more likely to make medical visits in the subsequent year than those with lower scores. Of course, the GHQ is not specific for depression. Magruder-Habib and her co-investigators (
69) found that patients who were identified using the SDS and whose physicians were told the scores were more likely to receive antidepressants than patients whose physicians did not receive the scores, but the difference was not statistically significant. They also noted that levels of depressive symptoms did not change over 12 months of follow-up for all patients.