In nearly all domains of medicine, quantified measures of outcome are used to characterize changes in a patient's symptoms during the course of treatment (for example, monitoring changes in blood pressure before and after the prescription of medication to determine its efficacy). Psychiatry has lagged behind other medical disciplines with respect to using standardized assessments of outcome to guide clinical decision making (
1,
2). In mental health care, assessments of outcome are often based on unstructured conversations between the client and prescriber that yield impressionistic judgments of progress rather than quantifiable data (
1).
The Group for the Advancement of Psychiatry recommends that health care systems implement standardized outcome assessments for individuals with mental illnesses (
2). This approach, known as measurement-based care, has been found to be both feasible and effective (
3). Standardized, self-report measures or brief symptom scales have been suggested as practical ways to monitor changes in key symptoms in routine practice (
1–
3).
Use of brief symptom rating scales is a component of the Texas Implementation of Medication Algorithms (TIMA), a disease management program using specific recommendations for medication (including dose and duration) and standardized brief assessments for monitoring outcomes for individuals with psychiatric disorders treated by publicly funded organizations providing mental health services in Texas. TIMA is based on the Texas Medication Algorithm Project (
4) and was used statewide during the period of this work. For individuals with schizophrenia, two semistructured assessments are administered to track changes in symptomatology and guide clinical decision making—the Positive Symptom Rating Scale (PSRS) and the Brief Negative Symptom Assessment (BNSA) (
5).
The extent to which these brief structured interviews can be reliably applied in routine clinical practice settings by direct-care staff is unclear. We investigated this question in a large, publicly funded community mental health center (CMHC).
Our project had three objectives: first, to determine the level of interrater reliability among case management staff using the PSRS and BNSA; second, to provide training to improve interrater reliability and prevent rater drift; and third, to observe case managers during the administration of these assessments to evaluate interviewing techniques and appropriate use of anchor points (
6).
Methods
Participants were 82 direct-care staff responsible for administering the PSRS and BNSA every three months to track symptom changes among their patients. Twenty-eight participants had master's degrees, and 52 had bachelor's degrees. The level of education was not documented for two participants. The mean±SD number of years the staff worked at the CMHC was 4.18±4.05. Phases 1 to 4 of the project described below took place between November 2006 and August 2008. Data were analyzed with the SAS statistical package. On the basis of the regulations of the institutional review board of the University of Texas Health Science Center, the project was not considered to be human subjects research, because it was part of a quality improvement initiative at the CMHC, and therefore, informed consent was not required.
In phase 1, direct-care staff participated in an initial rater assessment to determine whether their ratings on the combined brief scales agreed with established gold standards. Participants viewed and scored three brief interviews. For a rater to attain acceptable reliability, he or she needed to be within 1 point of the criterion rating on 80% percent of items.
In phase 2, experts provided detailed individual training to staff not meeting criteria for reliability. Anchor points of the scales were reviewed, and a detailed explanation of the criterion ratings was provided. Trainers also conducted on-site observation of direct-care staff to evaluate interviewing techniques and the application of the rating scales with actual consumers. The focus of the observation was on the staff member's ability to elicit clear statements regarding the presence or absence of symptoms, their frequency and severity, and the extent to which they interfered with the client's daily life.
In phase 3, raters were asked to score a series of new interviews to determine whether the training was able to improve reliability of raters who had not met the criterion and to assess rater drift among raters already certified as reliable.
In phase 4, we used a standardized rater training program for all new hires at the agency.
The PSRS assesses the four psychosis items from the expanded version of the Brief Psychiatric Rating Scale (BPRS) (
7)—hallucinations, unusual thought content, conceptual disorganization, and suspiciousness. The BNSA contains four items drawn from the Schedule for the Assessment of Negative Symptoms (
8) and the Negative Symptom Assessment (
9)—prolonged time to respond, reduced social drive, poor grooming and hygiene, and blunted affect. Each rater was identified as meeting or failing to meet the criterion for reliability for each item, and reliability was calculated on the basis of the eight items of the two brief scales combined.
Results
Of the 82 direct-care staff members, 57% (N=47) met criteria for rating reliably, and 43% (N=35) did not. There was no relationship between degree attained and whether an individual met criteria for reliability, nor was there one between years of service and reliability. The mean±SD reliability for individuals meeting criteria was 90.1%±6.4%. For those not meeting the reliability criteria, the average score was 69.8%±8.9%.
Results of the phase 2 observation of individuals who did not reach reliability at baseline revealed that several individuals were not using the structured interview questions or were not consulting the anchor points to make ratings. Trainers indicated that the interview and anchor points should be consulted in every case. Training staff reminded the raters to preface questions with statements that reminded consumers of the time frame and to obtain information on the frequency, severity, and the extent to which the symptom interfered with daily functioning before moving to the next question.
In phase 3, there were three opportunities for staff members to review taped interviews. For each participant, scores were averaged across the interviews that they rated (38% completed one tape, 40% completed two tapes, and 22% completed all three tapes). Of the 35 individuals who did not reach reliability criterion in phase 1, 29 participated in retesting. Of these, 55% (N=16) achieved reliability and 45% (N=13) remained below criterion. Average reliability on this retest was 81.1%±11.8%. Therefore, of the original 82 individuals participating, 77% (N=63) reached reliability on the brief scales component of training.
With respect to rater drift, of the original 47 individuals who met the criterion for reliability at baseline, 36 viewed additional tape-recorded interviews that used the brief scales. About 67% (24 of 36) maintained reliability, and the remaining 33% (N=12) did not.
In phase 4 we rolled out standardized rater training with all new employees. Ten new individuals were trained in our formalized program. 80% (eight of ten) met reliability criteria after this training, and the remaining two individuals missed attaining the criterion by 1 percentage point.
In all phases, all raters were made aware of rating deficits. There was some variability in the reliability of specific scale items. Overall, item 1 on the BNSA, “Prolonged time to response,” had the highest pass rate (that is, the percentage scoring within 1 point of the criterion 95.9%), and item 2 on the BNSA, “Unchanging facial expression,” had the lowest (68.7%). Item 1 has very clear behavioral criteria for rating. Item 2 is based solely on observation and may be the most subjective item on the brief scales. In general, individuals with more disorganized speech and behavior were the most difficult to rate, with average failure rates across items ranging between 22% and 24%.
Discussion and conclusions
It is important for psychiatry to move into measurement-based care, although there are a number of challenges in doing so. Results of this study suggest that rating scales can reliably be applied by a majority of direct-care staff. However, a training program is needed to ensure the reliability of these ratings. Moreover, rater drift must be considered, and periodic recalibration of raters is important (
7). There are some staff members who were not able to reliably utilize the rating scales even after the standard training was provided. Whether these individuals would improve with more targeted or longer training would need to be investigated in a follow-up program.
Other approaches to measurement-based care include having the physicians rather than case-management staff conduct the ratings or using self-report measures of symptomatology (
1,
3). Although self-report may be feasible for individuals with schizophrenia, problems with insight and delusional thinking may interfere with validity (
10).
The move to measurement-based care in psychiatry is important to ensure that we are helping individuals attain the most favorable outcomes. Once implemented, care must be taken to ensure that the measurements used are reliable and valid as administered by direct-care staff.
Acknowledgments and disclosures
This work was supported in part by grant R24-MH072830 from the National Institute of Mental Health.
Dr. Miller reports receiving grant funds from Pfizer, Inc., and he is a consultant for RBM, Inc. The other authors report no competing interests.