The Hamilton Depression Rating Scale
(1) was developed in the late 1950s to assess the effectiveness of the first generation of antidepressants and was originally published in 1960. Although Hamilton
(1) recognized that the scale had “room for improvement” (p. 56) and that further revision was necessary, the scale quickly became the standard measure of depression severity for clinical trials of antidepressants
(2,
3). The Hamilton depression scale has retained this function and is now the most commonly used measure of depression
(3). Our objective in this article is to provide a review of the Hamilton depression scale literature published since the last major evaluation of its psychometric properties, more than 20 years ago
(4). More recent reviews have appeared
(3,
5–7), but they have not systematically examined the literature with regard to a broad range of measurement issues. Significant developments in psychometric theory and practice have been made since the 1950s and need to be applied to instruments currently in use. We evaluate the Hamilton depression scale in light of these current standards and conclude by presenting arguments for and against retaining, revising, or rejecting the Hamilton depression scale as the gold standard for assessment of depression.
Conclusions
The Hamilton depression scale has been the standard for the assessment of depression for more than 40 years. Researchers and policy makers charged with the task of providing standards to evaluate treatment outcomes in depression are faced with three possible solutions: retain, revise, or reject. The latter solution argues for the development of a new instrument or the replacement of the Hamilton depression scale with existing, psychometrically superior instruments.
Many of the psychometric properties of the Hamilton depression scale are adequate and consistently meet established criteria. The internal, interrater, and retest reliability estimates for the overall Hamilton depression scale are mostly good, as are the internal reliability estimates at the item level. Similarly, established criteria are met for convergent, discriminant, and predictive validity, although the latter does suffer somewhat due to multidimensionality. At the item level, interrater and retest coefficients are weak for many items, and the internal reliability coefficients indicate that some items are problematic. The lack of individual item reliability is not necessarily a fatal psychometric flaw; what is critical is that the items as a whole provide adequate reliability.
Evaluation of item response shows that many of the individual items are poorly designed and sum to generate a total score whose meaning is multidimensional and unclear. The problem of multidimensionality was highlighted in the evaluation of factorial validity, which showed a failure to replicate a single unifying structure across studies. Although the unstable factor structure of the Hamilton depression scale may be partly attributable to the diagnostic diversity of population samples, well-designed scales assessing clearly defined constructs produce factor structures that are invariant across different populations
(88). Finally, the Hamilton depression scale is measuring a conception of depression that is now several decades old and that is, at best, only partly related to the operationalization of depression in DSM-IV.
These findings indicate that continued use of the Hamilton depression scale requires, at the very least, a complete overhaul of its constituent items. Accumulated empirical evidence offers some hope that substantial revision can redress a number of psychometric problems, thereby providing an improved measure. Shortened versions of the Hamilton depression scale converge on a common set of core features and in general have proven more effective in detecting change. The truncated item sets for these instruments, however, are limited in that they do not permit capture of the full depressive syndrome. Other studies based on item response theory methods have indicated that modifications of the rating scheme are readily implemented and can enhance the unidimensionality of these core symptoms in a manner that allows uniform assessment of change. Identifying a core set of symptoms with proven psychometric qualities, along with making rating scheme changes that would allow consistent assessment of the severity of depression, could provide a foundation for a reconstructed scale. One advantage of such a revision is that it would maintain continuity with the long-standing use of the original Hamilton depression scale. This sort of transition is probably more palatable and therefore more readily acceptable to regulatory commissions.
The Depression Rating Scale Standardization Team revised the Hamilton depression scale (i.e., the GRID-HAMD [
93,
94]) by employing several of the methodological advances we have been advocating in this article. They used item response theory methods to inform, in part, the revision process; developed clear structured interview prompts and scoring guidelines; and to some extent standardized the scoring system. We nonetheless believe that by making an effort to retain the original 17 items, the Depression Rating Scale Standardization Team failed to address many of the flaws of the original instrument. Most of the items still measure multiple constructs, items that have consistently been shown to be ineffective have been retained, and the scoring system still includes differential weighting of items. Moreover, the GRID-HAMD content is virtually unchanged from the original. All the items that appeared on the Hamilton depression scale in 1960 are included in the GRID-HAMD. Thus, this revision has neither removed items based on outdated concepts nor added items that incorporate contemporary definitions of depression.
Rejection of the Hamilton depression scale and replacement with an alternative existing measure or the implementation of a new instrument has scientifically compelling advantages over revision. The Inventory of Depressive Symptomatology
(95) and the Montgomery-Åsberg Depression Rating Scale
(96), designed to address the limitations of the Hamilton depression scale, represent two potential replacement alternatives. Although these instruments measure contemporary definitions of depression
(33), neither item response theory methods nor other contemporary measurement techniques were employed in their development. As indicated earlier, such techniques, especially item response theory, maximize the capacity of an instrument to detect change. On the other hand, the development and implementation of a new instrument that is based on current knowledge of depression and that takes advantage of psychometric and statistical advances might offer the best solution. The decision to replace the Hamilton depression scale with either an existing instrument or a newly developed instrument would ultimately rest on consensus that such an instrument could capture more adequately the full spectrum of the depression construct and on empirical evidence of the new instrument’s superiority in detecting treatment effects.
In conclusion, we have been struck with the marked contrast between the effort and scientific sophistication involved in designing new antidepressants and the continued reliance on antiquated concepts and methods for assessing change in the severity of the depression that these very medications are intended to affect. Effort in both areas is critical to the accessibility of new medications for patients with depression. Many scales and instruments used in psychiatry today are based on—or at least include—current DSM symptoms, and the measurement of depression should follow this trend. It is time to retire the Hamilton depression scale. The field needs to move forward and embrace a new gold standard that incorporates modern psychometric methods and contemporary definitions of depression.