To the Editor: Drs. Corruble and Hardy correctly note that the 17-item Hamilton depression scale was designed to assess the severity of depression in patients known to be depressed but contend that the preponderance of the studies we reviewed did not use the Hamilton depression scale in this manner, thus negating our claim that the instrument is invalid. We never claimed that the Hamilton depression scale is an invalid instrument. Instead, we stated that “established criteria are met for convergent, discriminant, and predictive validity” (p. 2174). We argued that the Hamilton depression scale lacks factorial and content validity. Many of the studies we reviewed used general psychiatric samples, which likely included patients who arguably would be expected to have a negative affect, even if they did not all meet the criteria for major depressive disorder. Nonetheless, if we accept Drs. Corruble and Hardy’s more restrictive list and examine factorial validity in the studies that used only depressed samples, we still do not have any evidence for this type of validity.
As for content validity, it remains the case that the Hamilton depression scale was based on an understanding of depression that is now more than 40 years old. The proliferation of long-form versions of the Hamilton depression scale attests to the perceived need for additional items to capture the full domain of depression. A modern depression severity instrument does not have to mesh perfectly with DSM-IV. Nonetheless, the DSM-IV symptoms are part of our current conceptualization of depression and should at least be evaluated for their potential contribution to the measurement of depression severity. Drs. Corruble and Hardy conclude, without supporting evidence, that the development of a better scale is unlikely. We pointed out that a better scale can already be found within the Hamilton depression scale items themselves. The various short forms all outperform the 17-item version, a consequence of the apparent multidimensionality of the full scale. Our review also noted the problems with simply adopting one of these short forms, and so we look forward to a new instrument that incorporates contemporary psychometric methods and current definitions of depression.
In their letter, Drs. Licht and Bech note that we failed to mention the Bech-Rafaelsen Melancholia Scale as a possible candidate to replace the Hamilton depression scale. The development of this instrument used item-response theory as well as a more comprehensive list of core symptoms. Unfortunately, space limitations prohibited a full discussion of all potential replacement instruments. We agree that the Bech-Rafaelsen Melancholia Scale is an excellent candidate for a “new gold standard,” and we look forward to research comparing this instrument with the other alternatives mentioned in our review—the Inventory of Depressive Symptomatology, the Montgomery-Åsberg Depression Rating Scale, and the measure currently being developed by the Depression Inventory Development Initiative.
Drs. Hsieh and Hsieh suggest that some of the psychometric terms and statistical indices used in our evaluation of the Hamilton depression scale may not be appropriate: for example, that the term “responsiveness” should be used instead of “predictive validity” to describe the capacity of the Hamilton depression scale to detect change in severity of depression. We agree that “responsiveness” is a more precise word, but we deliberately chose to use a more common and conceptually broader psychometric term. In the case of the predictive validity of the Hamilton depression scale, we focused on whether change scores predict change in depressive severity.
We agree that Pearson’s correlations are less than ideal for assessing the reliability of individual item-to-item comparisons, especially when the scaling is different. That said, all of the studies reviewed used this coefficient, causing us by necessity to rely on it. Note that the use of Pearson’s r likely produces inflated estimates of association relative to weighted kappa, which “corrects” for chance association, with the result that many of the individual Hamilton depression scale items are likely more problematic than we concluded. We would, however, argue that Pearson’s r is, in fact, appropriate for examining item-to-total correlations as composite scores, such as Hamilton depression scale total scores, approach interval-level measurement. Pearson’s r is widely used to compare individual items with total scores.
Finally, Drs. Hsieh and Hsieh suggest that a higher benchmark of internal reliability (e.g., Cronbach’s alpha ≥0.90) should be employed when examining instruments that will be used for the assessment of individuals. The consequence of doing so would be to demonstrate that only two of 13 studies reported adequate internal reliability. We employed a more liberal benchmark (i.e., Cronbach’s alpha ≥0.70) primarily because we did not want to be accused of applying an overly strict criterion
(1,
2). The Drs. Hsieh, however, raise a potentially important distinction between group- and individual-level comparisons. The Hamilton depression scale may, in fact, be even weaker than suggested by our article, with insufficient reliability for the assessment of depressive severity in individual patients.
Dr. Carroll argues that the Hamilton depression scale was developed to record the severity of clinical depressive illness, not to “quantitate the metaphysical construct called ‘major depression.’” We agree that the focus of the instrument is quantification of severity, not the fixing of a diagnosis, but we wonder how one establishes the severity of an illness without carefully considering the diagnostic and associated features of that illness. Quantifying severity does not require a perfect correspondence between the instrument and DSM-IV, but the instrument should be informed by changes in the diagnostic system. It seems unhelpful to retain an item such as “loss of insight,” which makes neither a conceptual nor an empirical contribution. Evaluating the potential contribution of more recently noted symptoms would better serve the measurement of depression. For example, “loss of concentration” is a widely known symptom that is included in DSM-IV but is not included in the Hamilton depression scale. Concern with the outdated item content of the Hamilton depression scale surely drives the proliferation of long-form versions.
Dr. Carroll also argues that the Hamilton depression scale is a “clinimetric index . . . focused on the patient’s burden of illness” and that the wide range of symptoms covered is “consistent with the pleomorphic presentations of clinical depression.” We disagree. Hamilton and colleagues stated clearly that the Hamilton depression scale is a structured rating scale designed to assess depression severity. Patients do appear with a wide range of symptoms, but a rating scale for depression should be limited to the symptoms that contribute to its measurement. For this reason, we do not agree that developing “a new scale based on contemporary concepts of major depression is unrealistic.” DSM-IV is far from perfect, but it does represent the official definition of the construct whose severity we are purporting to measure, while it also identifies several potentially important symptoms not included on the Hamilton depression scale. When so much else has changed in our knowledge both of depressive symptoms and of psychometrics, it makes little sense to argue that our best effort occurred in the late 1950s. “Biomarkers” and “endophenotypes” may be a desirable long-term goal, but it is not necessary to use outdated instruments in the interim.
Finally, Dr. Carroll argues that the “Hamilton depression scale is not surpassed on performance by any other scale.” On the contrary, the Hamilton depression scale is, in fact, surpassed by subscales composed of Hamilton depression scale items. If a 6-item subscale outperforms a 17-item full scale, it would appear that a majority of the items are actually compromising the use of the total score. One study
(3) found that the use of a Hamilton depression scale subscale would allow sample sizes to be cut by one-third without compromising power. We never claimed that the Hamilton depression scale was insensitive to change, and the predictive validity section of our article (pp. 2172–2173) reviews several studies that demonstrate the capacity of the Hamilton depression scale in this regard. What we did argue was that the multidimensional structure of the Hamilton depression scale makes difficult the evaluation of specific treatment effects.
Dr. Bech and associates from the Depression Rating Scale Standardization Team are concerned with our view that the GRID-HAMD was “virtually unchanged from the original,” arguing that their group, in fact, implemented many changes. We feel that this quotation may have been taken out of context. We wrote that “the GRID-HAMD content is virtually unchanged from the original” (p. 2174). In fact, we acknowledge that the GRID-HAMD offers much that we believe is necessary for construction of a better measure of depression severity, and we were concerned that the effort would be hampered by retention of the original item contents. That said, we have learned that recent efforts of the Depression Rating Scale Standardization Team have been directed toward carefully developing new items as well, a development that we applaud.