Full access

Article

Published Online: January 1998

A Ten-Year Review of the Validity and Clinical Utility of Depression Screening

Charles P. Schade, M.D., M.P.H., Everett R. Jones, Jr., M.D., and Byron J. Wittlin, M.D.Authors Info & Affiliations

Publication: Psychiatric Services

Volume 49, Number 1

https://doi.org/10.1176/ps.49.1.55

PDF/EPUB

Abstract

OBJECTIVE: Use of depression screening instruments in primary care is controversial. The authors reviewed research studies published since the development of national practice guidelines to determine whether new evidence might favor screening. The review focused on evidence-related validity and clinical utility of depression screening instruments. METHODS: Silver Platter MEDLINE was searched for English-language studies of depression screening instruments published between 1986 and 1995. Studies were classified by type—reviews of studies, outcome studies, validation studies. Results and conclusions: Fifty-nine studies met criteria for review. Validation studies were the most frequent type (39 studies) and were subclassified according to population, type of comparison, and analytical method. These studies documented the validity of screening instruments compared with formal criteria and demonstrated consistently better performance for systematic approaches compared with clinical impressions. Thirteen studies were reviews; those reviewing evidence for effectiveness disagreed in their conclusions. Only seven outcome studies related to depression screening instruments were found, and none showed measurable benefit in a screened population. Several studies showed that very brief instruments performed about as well as longer, well-validated questionnaires for screening in general populations.

Researchers, governmental agencies, payers for health care, and others have noted the high prevalence, morbidity, mortality, and costs associated with depressive disorders (1-3). The Epidemiologic Catchment Area (ECA) survey reported a 2.3 percent one-month prevalence rate for a major depressive episode in the community at large (4). Similarly, the Agency for Health Care Policy and Research reported a point prevalence of major depressive disorder of 2.3 to 3.2 percent for men and 4.5 to 9.3 percent for women in general community studies and 4.8 to 8.6 percent for both sexes combined in outpatient primary care settings (5). It also reported a lifetime risk for major depressive disorder of 7 to 12 percent in men and 20 to 25 percent in women.

Katon and Schulberg (6) cited studies in which major depression was estimated to occur in 2 to 4 percent of persons in the community, 5 to 10 percent of primary care patients, and 10 to 14 percent of medical inpatients. In each of these settings, Katon and Shulberg estimated that there are “two to three times as many persons with depressive symptoms that fall short of major depression criteria.” Koenig and colleagues (7) reported rates of major depressive disorder in hospitalized medically ill men of 22.4 percent for those under age 40 and 13.5 percent for those age 70 or over; they reported rates of minor depression of 18.1 percent and 29.2 percent, respectively, for the two age groups.

As described more fully below, numerous authorities have indicated that evidence is insufficient to recommend screening for depression in primary care settings, despite the high prevalence of major depressive disorder in general populations. This paper reviews literature from the last ten years on depression screening instruments to determine whether evidence favoring screening has been found.

Background

When broadly defined, depression has a greater impact on morbidity and mortality than its well-established relationship with suicide and suicide attempts would suggest. Depressive symptoms are associated with increases in utilization of and costs for care in health maintenance organizations (8,9), as well as longer lengths of stay and greater rates of rehospitalization (10).

Morbidity and mortality

Wells and coworkers (11) reported that among depressed patients, deterioration in function associated with depressive symptoms was at least as severe as, and added to, functional decline associated with eight chronic medical conditions. Among 2,980 ECA study participants, Broadhead and associates (12) found a 4.78 relative risk of disability in persons with major depression and a 1.55 relative risk in persons with minor depression and mood disturbance but not major depression. They noted that individuals with minor depression had 51 percent more disability days in the community than persons with major depression because of the greater prevalence of the former (12).

In a study of 2,393 subjects from the Los Angeles ECA site, Judd and colleagues (13) concluded that “significantly more people with subsyndromal depressive symptoms or major depression reported impairment in eight of ten functional domains than did subjects with no disorder.” More recently, Murray and Lopez (14) used disability-adjusted life years to show that depression is now fourth on a list of diseases that shorten life or cause disabilities in the entire world population.

Other researchers have reported increased mortality and specific disease morbidity associated with depressive disorders. In a 27-year follow-up epidemiological study that used multivariate analysis, Barefoot and Schroll (15) noted a 59 percent increased risk of death and a 71 percent increased risk of myocardial infarction for patients with high scores on depressive symptoms on a validated 40-item psychological assessment instrument. These findings remained unchanged even after the analysis controlled for risk factors and signs of disease at baseline. Morris and coworkers (16) reported that 53 percent of patients they identified as having either major or minor depression approximately two weeks after a stroke died over the course of a ten-year follow-up. The odds ratio for death among depressed patients was 3.4, compared with nondepressed patients, independent of age, sex, social class, type of stroke, lesion location, and level of social functioning. Koenig and associates (17) reported that depression increased mortality independent of severity of physical illness.

These and other studies clearly indicate that depressive symptoms have a potentially significant impact on patients' functioning and outcomes and affect rates of utilization of health services. What may not be as clear is whether identification and treatment of depressive symptoms result in improved outcomes or lower utilization and cost in the general medical sector.

Benefits of depression treatment

Research in defined populations of mental health patients documents the efficacy of currently available treatments for depression. Both pharmacotherapy (18-20) and specific forms of psychotherapy (21) reduce depressive symptoms both in the acute stage of illness (22) and over the long term (23,24).

Fewer studies have focused on depression in the primary care setting. Wells and Sturm (25) reported that appropriate treatment improved patients' functioning and outcome. They also showed potential cost-effectiveness when depressive disorders of primary care patients are adequately recognized and treated. Von Korff and coworkers (26) showed that improvement in depressive symptoms was associated with reduced disability days and disability scores. In their study, depression and disability improved simultaneously.

Many fewer individuals with depression in primary care settings are identified and treated than could benefit from appropriate care (7,27,28). Spitzer and associates (28) investigated the Primary Care Evaluation of Mental Disorders (PRIME-MD ) system, a structured screening and diagnostic system for mental disorders in primary care settings. They found that “48 percent of 287 patients with a PRIME-MD diagnosis who were somewhat or fairly well known to their physicians had not been recognized to have that diagnosis before the PRIME- MD evaluation.”

Recognition alone is not sufficient (29). Several researchers have reported improved rates of treatment by using screening instruments and other approaches to standardize care. In a randomized controlled trial among primary care patients with major and minor depression, Katon and coworkers (30) showed that a collaborative management approach in primary care settings improved patients' satisfaction with care and depression outcomes among patients with major but not minor depression. Because this model is labor intensive, practitioners and health care systems may desire models for care that are as effective but less expensive. Use of a standard screening instrument might be a way of reducing costs of diagnosis and monitoring.

The utility of depressive screening instruments in primary care settings is controversial. The U.S. Preventive Services Task Force (31) found “insufficient evidence to recommend for or against the routine use of standardized questionnaires to screen for depression in asymptomatic primary care patients.” Similar conclusions have been reached by other authorities (5,32). Full screens such as the Beck Depression Inventory (33) and the Hamilton Rating Scale for Depression (34) may be unfamiliar to non-mental-health professionals and more time intensive to administer and score. Shorter instruments could mitigate this problem.

Rationale for the literature review

In late 1995 and early 1996, as part of a Department of Veterans Affairs (VA) effort to develop and implement national practice guidelines for the recognition and management of major depressive disorder, we reviewed current literature on depression screening instruments. Members of the expert panels who were leading the VA's effort had expressed interest in screening for depression, despite known limitations of screening instruments. Two reasons were advanced: first, that depression is more prevalent among VA patients than in the general population, and second, that research done since the publication of the national practice guidelines (5) has favored screening. This review addresses the second point by presenting evidence from scientific literature published during the past ten years.

Materials and methods

We searched the published medical literature using Silver Platter MEDLINE (35). Articles for inclusion and primary references were identified from 1986 through 1995 MEDLINE files of articles published in the English language. Key words and the National Library of Medicine's subject headings included Beck Depression Inventory, Carroll Rating, CES-D (Center for Epidemiologic Studies Depression Scale), depression, depression—diagnosis, evaluation, General Health Questionnaire, Geriatric Depression Scale, Hamilton Rating Scale, Hopkins Symptom, IDS (Inventory for Depressive Symptomatology), mass screening, psychiatric status rating scales, and Zung Depression Scale.

We also used references obtained from bibliographies in review articles generated by the MEDLINE searches, published guidelines, and American Psychiatric Association publications. Articles were screened to determine if they were one of three types of publication: reviews of studies of screening instruments for depression; studies of any outcome of use of screening instruments, with “outcome” including health status change, therapeutic intervention, recognition of depression by health care providers, or health services utilization; and studies that validated instruments by direct comparison with other instruments used with the same subjects.

We specifically excluded articles that addressed only the reliability of instruments and those in which the only measure of validity was comparative prevalence estimates in similar populations. Although many instruments have been translated into foreign languages, we elected to use only articles about studies of English-speaking subjects.

For each type of study, slightly different information was abstracted from the selected articles. For the review articles, the abstracted elements were the instruments studied, the comparisons made, and the conclusions. For the validation studies, the elements were the instruments evaluated, the criterion measure of depression (if stated), a description of the population in which the instrument was tested, and the main statistical findings (area under the receiver operating curve, sensitivity, specificity, predictive value, kappa, and correlation coefficients). For outcome studies, the elements were the outcomes examined, the measures used, a description of the population in which the instrument was tested, and the results.

Results

Fifty-nine articles met criteria for inclusion. A complete list of the articles and a 25-page evidence table describing the studies and summarizing their findings is available from the first author. Also available is a companion glossary of the instruments reviewed in the articles. Thirty-nine of the 59 articles (66 percent) were validation studies, 13 (22 percent) were reviews, and seven (12 percent) were outcome studies.

Of more than 40 instruments mentioned in the articles, only five were mentioned in 20 percent or more of the studies. They were the Geriatric Depression Scale (GDS) (36), mentioned in 19 articles; the Beck Depression Inventory (BDI), mentioned in 16 articles; the General Health Questionnaire (GHQ) (37) and the short version of the Zung Depression Scale (SDS) (38), both mentioned in 15 articles; and the Center for Epidemiologic Studies Depression Scale (CES-D) (39), mentioned in 12 articles. The BDI, GDS, and SDS are depression-specific self-administered instruments. The GDS is intended for older persons. The CES-D is a subscale of a larger, population-based research screening tool. The GHQ is not specific for depression.

A rich variety of other tools was formally studied less frequently. They included several specifically intended for use in primary care settings, such as the Symptom Driven Diagnostic System for Primary Care (40) and the PRIME-MD (28).

Review articles

The 13 literature reviews were of three general types: users' guides, which consisted largely of expert opinion about the strengths and weaknesses of different instruments; quantitative comparisons and meta-analyses, which summarized studies that included statistical measures of validity against specific criterion instruments; and reviews of evidence, which examined whether using screening instruments produced favorable outcomes. The articles by Applegate and associates (41), Gallagher (42), Leserman and Koch (43), Kavan and colleagues (44), and Van Gorp and Cummings (45) typify users' guides. Leserman and Koch made a unique contribution by summarizing the studies showing instruments' sensitivity to change with treatment of depression.

Quantitative comparisons varied in sophistication. Mulrow and her co-investigators (46) used a computerized literature search with defined criteria to identify studies systematically, focusing on research comparing instruments to criterion standards in primary care settings. Studies were evaluated based on whether the criterion assessment of patients was independent of screening and whether a large proportion of subjects were both screened and diagnosed. The review includes a meta-analysis of sensitivities and specificities of the instruments included.

The review by Coulehan and colleagues (47) produced similar results to that by Mulrow, although with less rigorous analysis of the original research. Sensitivity and specificity measures cited in Coulehan's review appeared to be higher than those in Mulrow's review. Both reviews examined the BDI, CES-D, GHQ, and SDS.

Clark and Watson (48) explored the reasons depression screening instruments often lack specificity. Using meta-analytic techniques, the authors identified a nonspecific distress factor that is present to some degree in measures of both depression and anxiety and that confounds screening scales.

The reviews of evidence for effectiveness disagreed in their conclusions. Feightner and Worrall (49) identified only four studies of effectiveness of early intervention using screening and concluded that instruments have been shown only to increase detection of depression, with no documented impact on outcome. These authors advised caution in screening general populations of primary care patients for depression. Zung's analysis (50) did not include effectiveness studies but did cover six studies demonstrating that offering physicians knowledge of the results of depression screening tests increased their recognition of depression. Zung suggested that depression screening instruments be used as a “depression thermometer,” monitoring a patient's mental state in the same way a thermometer tracks a physical sign of health status.

Validation studies

There is no obvious single classification system to describe the mixture of 39 papers identified under the general heading of validation studies. Several classification approaches may be informative, although not necessarily exhaustive. Three are described here.

One way of classifying the studies is by type of comparison. Two major groups were noted—those comparing an instrument to a criterion measure, such as a DSM-III diagnosis (27 studies), and those comparing clinical judgment or diagnosis to a criterion measure (seven studies). The few studies that did neither were wide ranging, from those comparing an instrument to a subset (for example, Andresen and Malmgren [51]); an instrument to itself (for example, Burrows and associates [52]); or an instrument to measures that are not depression specific (for example, Wilkinson and Barczak [53]).

Studies comparing an instrument to a criterion measure test validity in the classic sense—if the instrument is in agreement with the criterion measure, it can be said to measure what it purports to measure, and therefore is valid, at least under the conditions tested. The circumstances, population, and definition of “agreement” are critical to such validity studies. As described below, the studies differed markedly in each of these areas. For example, five of the studies that compared the instrument and a criterion dealt with general populations. Of those, four reported sensitivity and specificity as the measure of agreement, while one reported area under the receiver operating curve, a summary measure of screening instrument performance in comparison to a criterion.

The studies of clinicians' judgment were a diverse group. Three studies—by Pond and associates (54), Coyne and colleagues (55), and Gerber and coworkers (56)—documented relatively poor performance of nonpsychiatrist physicians when their judgments were compared with criterion standards. Spitzer and colleagues (28) found high specificity but only moderate sensitivity of primary care physicians in detecting depression with an instrument-guided inquiry.

Studies may also be classified according to the population used. Such classification is important because instruments may be optimized for specific groups or because of concern that performance may suffer in particular groups. Twenty-two studies dealt with elderly patients, four specifically comparing cognitively normal groups and impaired groups. Nine focused on more general populations, mostly patients in ambulatory settings. Considered as a group, these papers suggested that several screening instruments, including ones not intended for detecting depression or not specific for depression, are nearly equivalent in measured validity against criterion standards. In general populations, clinicians' judgment did not perform as well as standardized approaches, as noted above.

A third way of looking at these studies is by method of analysis. Two statistical approaches appeared to be prevalent. Eleven of the studies presented results as a receiver operating curve (ROC), plotting true versus false positive rates at different cutoff scores of an instrument. These studies often included calculation of the area under the ROC as a summary measure of the instrument's performance. Sixteen papers reported sensitivity and specificity of instruments at stated cutoff scores. This approach often included calculation of positive and negative predictive values in the population studied. The remainder, which reported neither ROC nor sensitivity-specificity, employed a broad range of other techniques such as kappa scores, correlation coefficients, and rates of agreement.

Extensive discussion of the properties of the different analytic measures is beyond the scope of this review. Bailar and Mosteller (57) have briefly described the basic statistical methods, and Somoza and colleagues (58) presented a detailed explanation of the application of ROC methods to validation of depression screening tests.

Finally, one issue raised in five different studies is the minimum number of questions needed for efficient depression screening. Berwick and colleagues (59) reported that carefully selected short subsets of the Mental Health Inventory (MHI) (60) performed nearly as well as the complete instrument in detecting patients with major depression. Indeed, Berwick's team was able to identify a single item that worked comparably to the entire MHI. Broadhead's group (40) studied a set of just four questions and found sensitivity and specificity comparable to published validation studies of established, and much longer, instruments at their optimum cutoff points. Steer and his coworkers (61) identified two symptoms from the BDI that distinguished between anxious and depressed patients almost as well as the entire instrument.

Rost and associates (62) found that two-item subsets of the Diagnostic Interview Schedule (DIS) had 99 percent negative predictive value in three ECA populations when compared with the full DIS. Wyshak and Barsky (63) tested a single question using both physician and patient ratings and found that its performance was comparable to longer instruments, at least in the special population studied.

Outcome studies

A well-designed outcome study showing significant benefit to persons screened, identified, and treated for depression would be important evidence in a decision to implement routine depression screening. None of the seven studies in this review met that standard. Most dealt with recognition—whether physicians given results of screening would better recognize depression—and treatment, whether persons screening positive would be more likely to receive treatment for depression.

Results of recognition studies were mixed. Iliffe and colleagues (64) typified this approach, showing that use of screening instruments led to increased recognition of depression in one of two practices studied. Gold and Baraff (65) showed that providing GHQ scores to emergency physicians increased recognition of depression and referral for psychosocial services. Magruder-Habib and colleagues (66) also documented increased recognition and treatment if physicians were provided SDS scores. On the other hand, Shapiro's group (67) found that although screening did result in improved recognition for at least some patients, it did not lead to increased medical management of depression.

Two studies dealt with longer-term effects of screening. Berwick and associates (68) discovered that patients in a health maintenance organization who scored high on the GHQ were more likely to make medical visits in the subsequent year than those with lower scores. Of course, the GHQ is not specific for depression. Magruder-Habib and her co-investigators (69) found that patients who were identified using the SDS and whose physicians were told the scores were more likely to receive antidepressants than patients whose physicians did not receive the scores, but the difference was not statistically significant. They also noted that levels of depressive symptoms did not change over 12 months of follow-up for all patients.

Conclusions

In this literature review, we failed to confirm the hypothesis that studies over the past ten years have found better or conclusive evidence of benefit from screening for depression in general ambulatory populations. The conclusion of the U.S. Preventive Services Task Force is still borne out in the published literature. Few outcome studies related to depression screening instruments exist, and none show that screening leads to measurable benefit in a screened population. This finding does not negate studies showing that treatment of depression improves outcome.

Validation studies published after the time period covered by the formal review have produced similar conclusions about validity of screening instruments, particularly in geriatric populations. Loke and colleagues (70) demonstrated good sensitivity and specificity of two brief instruments compared with a structured diagnostic schedule and found geriatricians' diagnoses to have low sensitivity. Hermann and colleagues (71) showed excellent agreement between a 15-item GDS subset and the Montgomery-Åsberg Depression Rating Scale among geriatric outpatients.

In one outcome study, BDI screening was used to measure the effect of informing physicians of undiagnosed depressed individuals in their practices (72). Patients whose BDI score was disclosed to their physician did no better than control subjects over 12 months of follow-up. If anything, patients with diagnosed depression deteriorated over the follow-up compared with those without such a diagnosis.

Another outcome study published after the period of formal literature review found little impact on three-month postscreening health status when the Symptom Driven Diagnostic System for Primary Care (SDDS-PC) was used for screening (73). Patients who had at least one mental health concern identified by the instrument and whose physicians were given complete SDDS-PC results made fewer specialist visits over the next three months than those whose physicians did not have SDDS-PC data. The study was small (185 patients and 172 controls) and not limited to depression. The results of these two outcome studies would not have affected the conclusion of this review.

This collection of relatively recent studies of depression screening instruments is noteworthy in several respects. First, it does support the notion that depression screening instruments measure more than depression, as argued by Clark and Watson (74). This finding would explain the lack of specificity of even the most respected instruments.

Second, the instruments do work. One can detect clinically significant depression in diverse populations of English-speaking people using one or another of the validated tools. Several are effective among elderly individuals, but severe dementia reduces performance.

Third, several papers suggest that systematic approaches to diagnosis are better than nonsystematic ones, and their authors encourage better training of clinicians. The unfavorable comparisons of clinician judgment with standardized methods would seem to support this view, at least at the stage of diagnosis of depression. Higgins' review (75) of recognition and treatment of mental illness in primary care settings, which was not limited to depression, reached a similar conclusion. Higgins found only one large study, reported in conference proceedings, that showed improvement in measured outcomes over six months of follow-up among patients cared for by physicians with special training in interview techniques compared with patients whose physicians did not receive the training. Research supporting a systematic approach to diagnosis or better physician training in managing psychiatric disorders is not evidence in favor of the adoption of a screening tool in an entire population of patients.

Finally, evidence exists that “less is more” in depression screening. Because short instruments with well-selected questions appear to perform as well as more elaborate ones (for case finding), brevity may be a key feature.

The results of this review helped the VA decide to use a brief screening instrument for depression among its ambulatory patients. This recommendation is part of the VA's national guidelines for major depressive disorder, which are now being implemented systemwide. Although findings about the benefit to patient outcomes of brief screening remain controversial, the evidence indicates that brief screening instruments perform as well as longer instruments, can be administered with minimal personnel cost, and may lead to improvement in overall patient health. Further research is planned to determine whether these conclusions tilt the cost-benefit balance in favor of universal screening. Evaluation of the impact of more uniform recognition and treatment of depressive disorders is planned through the VA's external peer review program.

Acknowledgments

The review was supported by contract V101(93)P-1369 between the U.S. Department of Veterans Affairs Office of Performance and Quality and the West Virginia Medical Institute. The authors thank Carolyn D. Schade, M.L.S., who performed the computerized literature searches and screened the articles.

Footnote

Dr. Schade is a medical epidemiologist at the West Virginia Medical Institute, 3001 Chesterfield Place, Charleston, West Virginia 25304. Dr. Jones is assistant chief of the psychiatry service at the Veterans Affairs Medical Center in Salem Virginia. Dr. Wittlin is section chief of satellite clinics in the psychiatry service at the Veterans Affairs Medical Center in San Francisco.

References

Simon GE, Von Korff M, Barlow W: Health care costs of primary care patients with recognized depression. Archives of General Psychiatry 52:850-856, 1995

Format	RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks
Direct importt
Citation style
Style

Copy to clipboard
Tips for downloading citations

Abstract

Background

Morbidity and mortality

Benefits of depression treatment

Rationale for the literature review

Materials and methods

Results

Review articles

Validation studies

Outcome studies

Conclusions

Acknowledgments

Footnote

References

Information

Published In

History

Authors

Details

Metrics

Citations

Export Citations

View options

PDF/EPUB

Login options

Purchase Options

Not a subscriber?

Figures

Other

Share

Share article link

Share