Systematic reviews have summarized extensive evidence supporting the superiority of intensive, team-based care for treating a first episode of a psychotic disorder (
1). In the United States, a large-cluster randomized controlled trial demonstrated the feasibility of delivering such programs in community mental health services while retaining their clinical superiority (
2).
In 2014, the U.S. Congress approved legislation that mandated states to direct 5% of funding (increased to 10% in 2016) received through the Mental Health Block Grant (MHBG) to early intervention programs for people with first-episode psychosis (
3). This 10% set-aside fund has contributed to the establishment of 236 coordinated specialty care (CSC) programs across the United States (
4). A self-report survey of CSC program leaders identified 107 measures to assess fidelity to CSC components in first-episode psychosis services (
4). Most measures were not formal fidelity scales, suggesting the need for a reliable, valid, and feasible fidelity scale.
Fidelity scales have been developed in other mental health services to assess implementation and the degree to which programs adhere to evidence-based services (
5). An international review identified several fidelity scales for first-episode psychosis services (FEPSs) (
6); however, most research studies on the effectiveness of FEPSs have lacked assessments of fidelity that would support successful implementation. In the Recovery After an Initial Schizophrenia Episode study, Mueser et al. (
7) comprehensively assessed the fidelity for treatment components by using a set of diverse assessment procedures, including direct observation of the clinician’s practice through videos and adapting assessment methods used in other psychosocial interventions for individuals with severe mental illness. Such labor-intensive methods are feasible in research studies but not for widescale assessment of fidelity in routine practice. We sought to develop a scale that would be useful for both research and quality-improvement purposes.
This study is drawn from a national prospective study of first-episode programs. The research team invited 250 CSC programs across the United States receiving MHBG 10% set-aside funds to participate and ultimately enrolled 36 sites. The sites were selected to ensure geographic diversity, rural representation, and diverse service models. Participating programs received a payment to cover the costs of study participation. One component of this study was assessing fidelity. The fidelity measure that the research team selected was the First-Episode Psychosis Services Fidelity Scale (FEPS-FS) (
8). Unlike fidelity scales based on specific program models (
9,
10), the FEPS-FS was systematically developed with a standardized methodology for developing fidelity scales (
11). This process involved a systematic review to identify program components, an assessment of the level of evidence supporting these components, and an international expert consensus process to identify the essential components (
12). The scale has been tested for face validity, interrater reliability, and feasibility of programs in Canada and the United States (
12,
13). It has not yet been tested for predictive validity.
During the time when Addington et al. (
8,
12) were developing and testing the FEPS-FS, Heinssen and colleagues (
14) proposed the CSC model for first-episode psychosis programs. This model includes 12 key components composed of six key roles and clinical services and six core functions. This model has provided a shared framework for the emerging expansion of first-episode programs in the United States in response to the MHBG 10% set-aside funding. If validated, this framework promises to promote communication among both researchers and program managers, permitting standardized comparisons among programs. A conceptual framework is not sufficient for operationalizing a program model; however, a fidelity scale is needed to provide a roadmap for program leaders and to measure the components of the model.
In this article, we examined the interrater reliability of a revised fidelity scale, the FEPS-FS–Revised (FEPS-FS-R), and evaluated the feasibility of its application in a remote-assessment process. We also describe the variability of fidelity across the sample and the degree to which the sites adhered to the CSC components. We found that clusters of programs received training and support from different groups, and we explored potential differences in fidelity among these programs.
Methods
Overview
As part of a 3-year project evaluating a national sample of first-episode psychosis programs funded through the MHBG 10% set-aside, we conducted a psychometric study of the FEPS-FS-R. During the first project year, the fidelity research team modified the FEPS-FS, revised the fidelity assessment manual, and trained the primary fidelity rater on the fidelity assessment procedures. The fidelity rater underwent a 2-day training program. The first day comprised a review of the scale and the manual, and the second day involved rating a real program. Follow-up training comprised joint interviews of site staff and discussions about individual study site ratings.
We made minor modifications to the scale content on the basis of the CSC framework and feedback from our funders and research collaborators. Major modifications concerned the use of remote-assessment procedures. Addington et al. (
8) had worked with experienced first-episode psychosis clinicians as assessors to develop and validate onsite observational procedures for conducting the fidelity assessment. For the current project, we redesigned the procedures to accommodate large-scale fidelity and conduct assessments without a site visit. We developed a structured interview that could be conducted by phone or in person, procedures for abstracting health record data that could be uploaded to a secure website, and remote training for onsite health record abstractors.
During the second project year, we piloted the new assessment procedures at the study sites. We further improved the scale during this phase on the basis of feedback from multiple sources, including regular discussions among the fidelity research team after individual and joint fidelity assessments, the results of an interrater reliability study, and findings from another study of the FEPS-FS (
13). During the third project year, we assessed the 36 study sites using the revised scale and assessment procedures.
Study Sites
Thirty-six sites were selected from 250 sites receiving MHBG 10% set-aside funds. The sites were selected to include all U.S. regions, representing rural, urban, and suburban areas.
Data Collection Procedures
For each site, a trained fidelity assessor conducted a fidelity review and completed the fidelity ratings. Three researchers served as fidelity assessors completing 26, six, and four reviews, respectively. The data used to rate fidelity included staff interviews, a review of health records, and program documents. We interviewed at least four staff members, including team leader, prescriber, case manager or care coordinator, and supported employment or education specialist. Each site identified a local health record abstractor to complete the health record review. To enhance efficiency of data collection and ensure that the fidelity assessor obtained the best available data, we developed procedures to educate the site’s team leader and health record abstractor.
First, a research assistant contacted the site’s primary contact (typically the team leader) to schedule a team leader orientation, consisting of a 30-minute, one-on-one webinar with the study coordinator, outlining fidelity review procedures and requirements. The fidelity review team also provided the team leader with a fidelity scale guide for site leaders. Next, each site’s health record abstractor participated in a 30-minute webinar with the study coordinator, explaining how to complete the health record checklist. The fidelity review team also provided a health record abstractor guide (included in the site leader guide). The health record abstractor then recorded deidentified client information extracted from 10 randomly selected client charts. Each site also identified a staff member to upload required documents for the fidelity review and the completed health record checklist. These materials were uploaded to a secure document transfer portal and retrieved by the fidelity research team.
The administrative data and the health record review included information that required transcribing factual information, such as staffing levels as well as prescribed medications and dosages for individual clients, directly from the site’s records. Apart from clerical errors, the accuracy of these data may be presumed high. Other data were cross-checked during the interview. Some health record review data, such as the number of psychoeducation sessions attended, did require record reviewer judgment. The FEPS-FS-R review manual further details data collection and fidelity item rating procedures.
Fidelity Measure
The FEPS-FS-R consists of 33 distinct items; supported employment and supported education were labeled 27A and 27B. Each item is rated on a 5-point behaviorally anchored continuum, with a rating of 5 indicating full adherence to the evidence base (or best practice guidelines) and a rating of 1 representing substantial lack of model adherence. Ratings of 4, 3, and 2 represent equally spaced increments between these two extremes. To improve the reliability of the ratings, we designed the fidelity scale to consist of items that require little clinician inference.
Following the scoring conventions used with standardized evidence-based practice fidelity scales (
15), we summed item ratings to obtain a total score with values ranging from 33 to 165. A total score of ≥149 (i.e., ≥90% of the maximum) is considered “excellent” fidelity, between 132 and 148 (i.e., between 80% and 89%) “good” fidelity, between 116 and 131 (i.e., between 70% and 79%) “fair” fidelity, and of <116 (i.e., <70%) “poor” fidelity.
Reliability Substudy
During the final project year, four independent raters used the 33-item FEPS-FS-R to rate a convenience sample of five sites, independently reviewing the site’s administrative and health record data. Each rater also attended the phone interviews with the four staff members at each site. These interviews were conducted by an interviewer who was also one of the raters. The raters each independently rated fidelity on the basis of data from the three sources of information.
Data Analysis
For the interrater reliability substudy, we used intraclass correlation coefficients (ICCs) for each of the five sites and across the total scores to assess reliability. We selected a two-way, mixed-effects model (single-rater type and absolute agreement) (
16) and for analysis with IBM SPSS Statistics, version 25. We used the single-rater type because the standard fidelity assessment protocol calls for fidelity assessment rating by one person. We used the absolute agreement option for estimating reliability because the fidelity scale ultimately will be used to establish benchmarks, which depend on specific cutoff scores. We assessed feasibility by asking these sites about the time required for the data collection and submission, and we recorded the time required for fidelity interviews.
Next, we calculated M and SD values; distributions of item scores across the sites for full (rating=5), adequate (rating=4), and poor (rating=1–3) fidelity; and site scores at 18 months. We also tested potential predictors of fidelity, including urbanicity of the community served, adherence to specific CSC models, and program size measured by number of clients served and full-time equivalent staffing, using analysis of variance and Pearson correlations.
Following HIPAA requirements, the research team did not receive any data with personal identifying information. Local ethics review boards reviewed and approved the research study protocol. The administrative data and those abstracted from health records were uploaded to a secure website.
Results
All 36 (100%) of the MHBG study sites completed the fidelity reviews. The number of staff interviewed at each site varied, and primary interviews were supplemented with information from other interviewees when the designated interviewee was unable to provide critical information. As shown in
Table 1, site-level interrater reliability ranged from good to excellent (ICC of ≥0.80) for all five sites. The ICC value for the total scores across each site was 0.91 (95% confidence interval=0.72–0.99, p<0.001), also suggesting excellent reliability. The mean range in total scores across the four raters was 5.2 (N=5 sites); excluding the least experienced fidelity rater (who generally gave lower ratings), the mean score range was 3.4, further suggesting convergence in ratings. We examined item-level ratings for the five sites and found no obvious pattern in rating discrepancies by item.
The mean time the CSC team required for collecting and submitting data and for completing staff interviews was 10.5 hours (
Table 2). In
Table 3, we present descriptive statistics (M, SD, and frequency distributions) for the 32 items of the FEPS-FS-R in the sample of the 36 sites. Most sites achieved high fidelity on a substantial majority of items. The mean item rating was 4.16, and 23 items (72%) were used with good or excellent fidelity. Two sites (6%) achieved excellent fidelity, 25 (69%) achieved good fidelity, and nine (25%) achieved fair fidelity. No site scored below the minimum for fair fidelity.
Five items received low-fidelity ratings (median rating <4) at most sites: community served (<60% of incident cases from the catchment area were admitted to the program), duration of first-episode psychosis program (program served patients for ≤2 years), early intervention (≥40% of patients had a psychiatric hospitalization before CSC admission), family psychoeducation (<70% of families received an evidence-based family education program), and supported employment (employment services met no more than five of seven evidence-based elements of supported employment).
The results for the fidelity ratings of 12 CSC components (
14) are presented in
Table 4. Two components, transition of care and assuring fidelity to the early intervention treatment model, were not assessed. All but one site achieved good fidelity to the first three components (team-based approach, team leadership, and case management). One-third of the sites delivered good fidelity to supported employment, and two-thirds provided supported education with good fidelity. More than four-fifths of sites provided good fidelity to evidence-based psychotherapy, pharmacotherapy, and health management. Regarding family education and support, families were consistently involved in the initial assessment at all but two sites; however, a minority of sites delivered good fidelity to ongoing family education and support.
We examined three potential predictors of fidelity: urbanicity of community served, adherence to specific CSC models, and program size. Urban, suburban, and rural sites did not differ significantly in fidelity. Sites also did not differ in fidelity according to the primary model followed according to five different CSC models: The Early Assessment and Support Alliance (EASA), Coordinated Specialty Care for First Episode Psychosis (FIRST), NAVIGATE Coordinated Specialty Care, OnTrackNY Coordinated Specialty Care, and Portland Identification and Early Referral program (PIER). The overall F test for differences among groups was not statistically significant. Both of the measures regarding program size did not correlate with fidelity (number of clients served: r=0.21, p=0.23; full-time equivalent staffing: r=0.18, p=0.30).
Discussion
In this study, we evaluated the feasibility and reliability of the FEPS-FS-R procedures for remote assessment of fidelity of first-episode psychosis services to the CSC model, finding that these procedures can be conducted with good to excellent interrater reliability. The average time commitment for programs previously completing an initial review was 10.5 hours, suggesting that the process is both feasible from a program perspective and sustainable from a system perspective. We identified five CSC components that most programs found challenging to implement with fidelity. Supported employment received low ratings at some sites because of the lack of training or awareness of evidence-based principles. The early intervention item also received low ratings, and we note that this fidelity item rating is measured by using the percentage of caseloads with a psychiatric hospitalization before program enrollment. Our findings that 60%–80% of patients had received inpatient care before program enrollment are consistent with those from the Recovery After an Initial Schizophrenia Episode study, which found that 78% of patients had undergone a previous hospitalization (
17). Most sites reported that CSC enrollment was usually limited to 2 years. Increasingly, research supports the benefits of providing services for longer periods, that is, >2 years (
18–
21). Population served is an item comparing caseload size for a site with the estimated incidence of first-episode psychosis in communities served by the site. Most sites fell short of this standard, which estimates the demand for services for all those who might benefit.
Fidelity of evidence-based practices is commonly assessed by computing a total score based on the sum of individual items without weighting the items (
6). A unit-weighting scheme has been shown to be statistically justified, but a simple scoring system is more readily adopted for quality improvement in routine practice. Nonetheless, we recommend that the total score not be used in isolation but interpreted in light of the individual item scores. In addition, further research should determine whether some items should be removed for either psychometric or substantive reasons. Unlike some other first-episode psychosis fidelity scales, the FEPS-FS-R does not categorize items into essential and secondary components (
9), which would lead to reporting of subscale scores.
The small sample size suggests caution in interpreting a lack of statistically significant predictors of fidelity, and we note nonsignificant differences in fidelity across teams adopting one of several well-known CSC models, such as NAVIGATE. Thus, we found no evidence that the FEPS-FS-R is inappropriate for a specific “brand name” model. This finding may reflect the fact that FEPS-FS-R was developed after a systematic review of the literature on FEPS support (
8,
12) rather than from a single effective program model as done previously (
9,
22). We also did not find that the type of community served, adherence to specific CSC models, or program size correlate with fidelity. The FEPS-FS-R items gave good coverage of most components identified in the CSC model, suggesting that this scale is an appropriate measure for assessing fidelity to the CSC model.
The limitations of this study included a small and nonrandomly selected sample, both of which limit the generalizability of the findings. The correlational analyses were limited by small sample size and a restricted range. Some items exhibited little variation, suggesting the need for recalibration of the item rating scale. Some fidelity item ratings were based exclusively on the administrative data and the health record data provided by the site, the accuracy of which was not independently verified. Staff interviews may have introduced self-report bias. Finally, future research is needed to assess the generalizability of the interrater reliability found here to other fidelity reviewers.
Conclusions
The findings of this study support the conclusion that remote fidelity assessment with the FEPS-FS-R has face validity and can be both reliable and feasible. Importantly, the scale distinguished between sites rated as having excellent, good, or poor fidelity, suggesting that it is sensitive to differences among sites due to specific practice patterns. Finally, it indicates that the scale assesses the components of the CSC model.
The main finding regarding fidelity to CSC in the United States is that most sites achieved high fidelity on most fidelity items. Further research with larger samples is needed to evaluate whether the CSC model is robust and feasible to implement in a range of settings. These results support the use of the FEPS-FS-R scale as an outcome measure for implementation studies and as a practical tool for quality assurance by health care providers or funders. Further research is required to test the predictive validity of the scale.