The purpose of this study was to develop and test the feasibility, reliability, and validity of the First-Episode Psychosis Services Fidelity Scale (FEPS-FS). Fidelity refers to the degree of implementation of an evidence-based practice (
1). Fidelity scales provide a list of objective criteria by which a program or intervention is judged to adhere to an intervention that is the reference standard. Such scales have multiple applications in research, quality management, and accreditation (
2–
4).
The application of fidelity scales in first-episode psychosis services (FEPS) has been limited. In the United Kingdom, the EDEN study (Evaluating the Development and Impact of Early Intervention Services) developed a fidelity scale by using an expert clinician consensus process, and the scale was refined by researchers (
5). In the United States, EASA (Oregon Early Assessment and Support Alliance) developed a fidelity scale by using a process of expert committees; the scale has been used in support of program implementation and quality control (
6). In the United States, the RAISE Connections program (Recovery After an Initial Schizophrenia Episode) reported on fidelity by using routinely collected program data from two program sites (
7). None of these three scales were developed with a three-step knowledge synthesis process comprising systematic reviews, evidence ratings, and international expert consensus.
Fidelity scales can be developed on the basis of a successful program with proven effectiveness or from systematic reviews of the literature, provided there is sufficient available research (
8). We determined that there was sufficient evidence to develop a scale that is based on the evidence for the effectiveness of individual components of FEPS and on evidence from robust large-scale randomized controlled trials of the effectiveness of FEPS (
9–
12). In contrast to a scale developed from a single effective program, a scale that is derived from a comprehensive review of the literature can be more easily applied to a broad range of programs. We designed this study to develop, refine, and test the reliability and face validity of a fidelity scale for FEPS that was based on systematic review, ratings of evidence, and international expert consensus.
Methods
We first identified 32 essential components of FEPS (
13) and then identified characteristics of effective treatment teams from a systematic review of the mental health team literature. To transform these essential components into a useful fidelity scale, a group consisting of a fidelity scale expert, a health services researcher, experts in first-episode psychosis, and an epidemiologist transformed the components into operational definitions linked to specified anchor points on a 5-point scale. This was achieved through an iterative process of reviews of evidence on team functioning and component integration and, where possible, reviews of evidence that supported the “dosing” of interventions. This process resulted in a 32-component scale comprising 22 individual treatment components and ten team-based components. Each item is rated on a 5-point scale from 1 to 5. The total fidelity score is obtained by summing the individual items.
We also developed a version of the scale to be used for an individual that includes only the intervention components. In this version, the component descriptors remain the same, but the component ratings change from percentages of patients who received the service to the number of services received by an individual patient. The individual scale is designed to make comparisons between team-based and non–team-based services. Finally, we developed a manual to guide the process of fidelity assessment.
The next step in development was a study of the feasibility, reliability, and face validity of the draft FEPS-FS. The project received ethics institutional review board approval from the University of Calgary.
We first assessed one program with the first draft of the scale and the draft manual. Information on the program had been obtained by video-recording interviews with key informants, collecting administrative data, and reviewing ten randomly selected health records. The results were presented to and discussed with the other investigators at a two-day meeting. This resulted in significant modifications of both the scale and the rating manual—for example, adding that clinical nurse practitioners as well as psychiatrists can prescribe medications and that cognitive-behavioral therapy (CBT) should be delivered by a therapist with formal training in CBT or by a therapist trained to follow a formal manual based on CBT principles, such as Individual Resiliency Training (a component of the RAISE protocol [
14]).
The scale was further tested during fidelity site visits in the first six months of 2015 by using combinations of two or three raters, all of whom had participated in the assessment of the first program. Site visits have been established as a best practice for fidelity assessment (
15). The four U.S. sites included two small rural programs embedded in community mental health centers and two large downtown urban programs. Each U.S. site assessment was conducted by the same three raters. In the U.S. sites, the assessment included the administration of the FEPS-FS and an existing first-episode fidelity scale, the EASA scale (
6), which is used routinely as part of a statewide quality improvement process conducted by a state-funded technical assistance center. Two Canadian sites were located in urban areas, and each fidelity visit was conducted in one day.
All fidelity visits included a review of program documents, including Web sites, policies and practices, presentations, and handouts for community partner education and for client and family education. Fidelity raters examined administrative data, discovering that although the content varied across programs, the data always included current staffing, annual admissions and discharges, and time from referral to face-to-face meetings. During the fidelity visit, the raters observed a team meeting; reviewed ten health records; and met with the program manager and senior administrator, clinicians, psychiatrists or other prescribers, family members, and clients. Raters completed their ratings independently during fidelity visits. Within a week of the site visit, raters held a teleconference, reported their individual ratings, and ultimately reached a consensus rating. The interrater reliability data were based on the independent ratings completed before consensus ratings were determined. The process of arriving at a consensus often generated suggestions for modifying item component descriptors or rating scales. Interrater reliability was assessed by calculating the intraclass correlation between each rater’s independent ratings on individual items.
Interrater reliability of the FEPS-FS was determined by intraclass correlation coefficient (ICC) by using a two-way random-effects model. We first calculated the ICC over the 31 items for each site. Because the raters for the four U.S. sites were the same, we also calculated the ICC of the overall fidelity scores of the four sites. The ICC was calculated by using Stata, version 14.0.
The comparison between the final version of the FEPS-FS and the other fidelity scales—the EDEN fidelity scale (
5), the EASA fidelity scale (
6), and the RAISE-C monitoring tool (
7)—was a descriptive comparison of the development processes used, the number of components included, the items or domains assessed that were common to all scales, and the number of items or domains shared by the FEPS-FS and each of the other scales. The component descriptions across the scales were not identical; therefore, the term “domain” is used here because the same concept or domain is being addressed.
Results
The basic structure of the scale did not change during the study; however, two items were dropped and one added during the study. In addition, the wording of component descriptors was clarified, and some changes to the rating descriptors were made. [The full final version of the 31-item scale is available in an online supplement to this report.] A manual was developed to support the raters in making reliable ratings of the descriptors and is available on request from the first author.
Collecting data from multiple sources in order to score the FEPS-FS proved feasible, and raters integrated all sources of data to reach their best estimate for each component. The interrater reliability of ratings calculated at an item level was high; the ICC across items and based on the independent ratings before the discussion of consensus was .842 (95% confidence interval [CI]=.795–.882) and ranged from .741 (CI=.590–.854) to .972 (CI=.951–.986). These results were achieved for the four programs rated by the same three raters. The ICC across the overall scores of the four U.S. sites was .928 (CI=.581–.995). [Means and standard deviations for each item are presented in the online supplement.]
Programs that were considered to have adequate fidelity had a mean score of 86% (range 81% to 89%) on the basis of the consensus rating of the total potential score. The single program considered not to meet standards according to an established fidelity scale (EASA) scored 70%. We took this result to indicate that FEPS-FS was capable of differentiating between programs of different quality. We also considered that this finding provided early support for a potential cutoff of 80% as a satisfactory rating or an average rating of 4 on each 5-point scale.
Content validity was supported by comparison with other fidelity scales (
Table 1). A total of 17 components are assessed by all four scales. Because of its shorter length, the FEPS-FS has the highest proportion of components common to all scales (53%). In addition, compared with the other three measures, the FEPS-FS has the highest proportion of overlap with each of the other scales (average of 75%). The FEPS-FS has more components in common with each of the other scales than each has in common with the others. Each fidelity scale was developed by using a different process and used different methods for undertaking the fidelity assessment [see
online supplement].