To the Editor: We agree in part with Dr. Janca that our very high levels of interrater reliability regarding the DSM-IV axis V clinician rating scales may have been influenced by extensive training, high motivation on the part of the clinicians, and the clinicians’ working within a larger research protocol. Also, it is fairly common that interrater reliability for a variety of clinical conditions or constructs is higher between raters at the same site than for raters across sites
(1,
2). However, our results are quite similar to those from a number of other studies involving the Global Assessment of Functioning Scale and its predecessor, the Global Assessment Scale
(3–
10). This prior research has demonstrated the interrater reliability of the Global Assessment of Functioning Scale as in the “good” or “excellent” range (ICC/k=0.60–0.74 and ICC/k=0.75 or more, respectively
[11]). In addition, the WHO Short Disability Assessment Schedule, which possesses subcomponents similar to those in the Global Assessment of Relational Functioning Scale and the Social and Occupational Functioning Assessment Scale, has been shown to possess “good” interrater reliability (ICC=0.62) in at least one multisite field trial
(12).
Furthermore, we disagree with Dr. Janca’s conclusion that our findings may not represent a true psychometric evaluation of these scales. We base this disagreement on three potentially related issues for further research in the assessment of multiaxial psychiatric functioning. Our discussion of these issues is particularly relevant to the rating of patient-clinician interactions and interview narratives in psychology and psychiatry.
First, the high level of agreement between the two raters in our study of the DSM-IV axis V scales suggests that these measures may be used to reliably rate the general severity of psychopathology and relational, social, and occupational functioning. The specific rating criteria developed for the DSM-IV axis V scales appear sufficiently clear to produce high levels of interrater reliability. The extensive supervised training of raters in the use of this scale likely contributed to the high level of agreement between raters. The low interrater reliability coefficients for the DSM-IV axis V scales found in other studies may not be assumed to reflect poor coding criteria or scale definition but rather may be due to poor or inadequate rater training.
While time constraints may prohibit such extensive training, it provides an optimal level of familiarity with the DSM-IV axis V scales and helps raters make subtle distinctions between scores before rating the patients included in the data analyses. The excellent interrater reliability coefficients achieved in our study suggested that the general severity of psychopathology and relational, social, and occupational functioning can be reliably coded and suggested the importance of training judges before coding begins.
Second, we encourage future investigators to examine the differential impact of the time or length of the interview in relation to reliability. The length of interviews used in most reliability field trials usually ranges from approximately 45 minutes up to 2 hours. The ratings from our original study were based on two sessions, each lasting approximately 3 hours. The higher levels of interrater reliability that were found in our work may be related to the clinician’s spending this additional time interacting with the patient. The implications of time or length of interviews on reliability have rarely been discussed in the psychiatric literature, and given the current impingement of third-party payers and the reduced support for more thorough evaluations
(13,
14), this seems an especially important issue. If clinicians are unduly limited in the time spent on an assessment, then less reliability, misdiagnosis, and potential problems for treatment may result.
In addition to including extra time spent by the clinicians, both in training and in interacting with their patients, our study also focused parts of the interview on key relational episodes from patients’ lives. This focus on patient narratives during the interview
(15), as well as the organization of the interview and feedback session from a therapeutic assessment model
(16), may have contributed to the higher reliability of the interview or videotape raters. Rather than focusing simply on the description of psychiatric symptoms or on a structured interview (i.e., the Structured Clinical Interview for DSM), the patients were encouraged to describe and explore relational interactions (thoughts, feelings, and fantasies) associated with the appearance of symptoms. In this manner, the clinicians attempted to enlist the patients to help them clarify and understand the impact of these experiences, both past and present, on their functioning. This relationally based exploration was focused on helping clinicians gain a better understanding of the personal meaning of life experiences related to psychiatric symptoms as well as explore prior successful and unsuccessful ways of coping with problems or symptoms.
The amount of prerequisite training on any scale applied to interview data (or any patient-clinician interaction) will invariably affect the subsequent reliability of that scale or measure. It is also possible that additional time spent and/or the relational focus of an interview can aid clinicians in making more reliable assessments of the general severity of psychopathology and relational, social, and occupational functioning. Perhaps when examining a patient’s general severity of psychopathology and relational, social, and occupational functioning, clinicians should be aided by first training to meet an acceptable criterion for accuracy on a given scale, spending additional time with the patient, and then examining psychiatric symptoms and relational, social, and occupational functioning within an interpersonal and narrative context. In contrast, when adequate prerequisite training, involved patient-clinician interaction, and exploration of functioning within a relational context are not present, the true psychometric properties of any clinician rating scale may be underestimated.