The growth of evidence-based practices (EBPs) in mental health treatment, particularly since the late 1990s (
1), has increased the demand for fidelity assessment (
2). EBPs have been demonstrated to be effective, and implementation is expected to achieve similar results in all treatment settings (
2). Supported employment as operationalized by the individual placement and support (IPS) model is an EBP that has demonstrated effectiveness in improving vocational outcomes for persons with mental disorders (
3,
4). However, when implemented in a new site and with new personnel, the model may not be implemented properly and thus may not achieve the intended results (
5).
Fidelity scales examine the extent to which a program is implementing core principles and procedures of an EBP (
6). Assessors follow a protocol to gather information from a variety of sources. In-person visits typically include interviews with multiple stakeholders, including program leadership, staff implementing the program, and clients. Program documentation, including client charts and other clinical records, are typically reviewed (
2).
Independent fidelity assessment can be expensive and time consuming, and as the number of EBPs grows, it can be burdensome for agencies to identify qualified assessors. The intensive one- or two-day process can be burdensome for program sites (
7). Consequently, some programs have begun conducting self-assessments to complement and supplement independent assessments (
7)—for example, undertaking self- and independent assessments in alternate years. Studies of assertive community treatment have shown that self- and independent assessments can yield comparable results under some circumstances (
8,
9). However, these results may not be generalizable to all EBPs; self-assessments may be best undertaken in stable programs with a history of good fidelity (
8), where staff are following a defined protocol (
7).
In this study, we examined how assessment methods compare within an IPS model in which programs receive extensive training and support to collect self-reported data following the IPS fidelity protocol.
Methods
Fidelity assessments were conducted by program staff (self-assessments) and by independent expert raters (independent assessments) at 11 personalized recovery-oriented services (PROS) programs across New York State (NYS). PROS is an outpatient mental health program model that sets a clear expectation in regard to the implementation of recovery-oriented EBPs. Through funding policies, the NYS Office of Mental Health provides incentives for adoption of these practices, which include IPS (
10).
Fidelity assessments were a component of a comprehensive training and implementation technical assistance package offered to PROS programs across NYS by the Center for Practice Innovations (CPI) (
10). Programs participated in regional learning collaboratives that provided face-to-face and online training and support.
A continuous quality improvement process served as the foundation for learning collaborative activity. Participating programs routinely collected and shared data, including performance indicators and fidelity ratings. Leaders of each learning collaborative structured the process so that programs experienced the use of data as helpful for their implementation efforts and not as punitive. In the learning collaboratives, PROS program staff were taught about IPS fidelity generally and about how to conduct fidelity self-assessments specifically, through Webinars and program-specific consultation calls and visits.
A total of 52 PROS programs completed fidelity self-assessments during the last quarter of 2014. Programs used the IPS Supported Employment Fidelity Scale (
3,
11), which consists of 25 items clustered into three sections (staffing, organization, and services). Each item is scored on a 5-point scale, and the maximum total score is 125.
The programs completing self-assessments were clustered into four regions and were randomized within each region. Within each region, programs were contacted following the order of randomization and were asked to voluntarily participate in an independent fidelity assessment. Overall, a total of 20 programs were contacted before three programs in each region agreed to participate in an independent assessment. One of these 12 programs did not have an independent assessment because of scheduling issues. The independent assessments occurred during the second quarter of 2015. The time between the 2014 self-assessments and the independent assessments ranged from two to eight months, with a mean of five months. The eight invited programs that did not participate in an independent assessment cited lack of time or lack of interest, or they simply did not respond to requests. Mean self-assessment scores did not differ significantly between the 11 programs that agreed to be independently assessed and the eight invited programs that did not. Mean self-assessment scores for the 11 programs were also not significantly different from scores of the 41 other programs that completed self-assessments.
Two independent raters, external to the agencies and to CPI, conducted the independent assessments. One rater was trained by the developers of IPS and has conducted independent assessments for many years. The other rater was trained by the first rater through didactics, modeling, and coaching. Two independent assessments were conducted by both raters, and nine were conducted by one of the two raters. The number of interviews varied by the composition of program staff but generally included the program director, supported employment supervisor, one or more supported employment workers, one or more clinicians, and up to five clients. In addition, assessors reviewed clinical documentation, including a sample of client charts, supported employment caseload, and job development logs. The independent assessments were completed in one day because of the typically small scale of IPS implementation at these program sites (only two of the 11 programs had more than 1.0 full-time-equivalent staff). For comparison, among the 130 programs participating in the IPS learning community nationwide, the median is three IPS specialists per program (personal communication, Bond G, 2016).
Fidelity scores for the two assessment methods were compared by using paired t tests and two-way mixed-effects intraclass correlation coefficients (ICCs) with consistency of agreement (individual measurement). We also examined the effect size of the differences between the assessment scores, by using Cohen’s d. Analyses were conducted with IBM SPSS, version 23. This program evaluation did not constitute human subjects research as defined by the Institutional Review Board (IRB) of the NYS Psychiatric Institute, and thus no IRB approval was needed.
Results
As shown in
Table 1, mean total scores for the independent assessments and the self-assessments did not differ significantly and indicated fair interrater agreement (ICC=.52) (
12). The scores are within the range of IPS guidelines for fair fidelity (75–99 of a total of 125) to the IPS model (
11). The independent assessments found three programs with good fidelity (total scores >99) and seven with fair fidelity (total scores 75–99) and deemed one as “not IPS” (total score <75). Self-assessments found four programs with good fidelity and six with fair fidelity, deeming one “not IPS.”
Although the mean scores did not differ significantly, we found significant variation on some of the individual scale items. For two items, paired t tests showed significant differences between the self- and independent assessments: time-unlimited follow-along supports (p=.01) and work incentives planning (p=.04). In addition, differences on seven of the 25 items approached a medium effect size (Cohen’s d ≥.4). Moreover, ICCs on eight of the 25 items were below .00, which can occur in two-way mixed-effects ICC models, and another five had ICCs below .40, indicating poor interrater agreement (
12). Thus some variability in individual items was observed in this small sample.
Discussion
Is there a place for fidelity self-assessment? This issue has received attention recently (
7,
13) and, given the demand for increased fidelity assessment with widespread adoption of EBPs, will continue to benefit from close examination. Bond (
13) cautioned against replacing independent fidelity reviews with self-assessments while also noting the usefulness of self-assessment for quality improvement. Can self-assessments be trusted? If so, under what conditions? The data presented here may help move the discussion along.
Across the 11 programs, no significant differences were found between total mean fidelity scores for the self-assessment and independent assessment, and all scores were within the range of fair fidelity. This suboptimal fidelity points to opportunities across the state and in individual programs for continuous quality improvement efforts. Only two items were significantly different between assessment methods. Independent raters gave a lower rating (average difference of .91 points) to estimates of time-unlimited follow-along supports. In the PROS programs, this IPS component has a complicated definition, because clients step down from intensive PROS services to less-intensive ongoing rehabilitation and support services when they obtain a competitive job. Thus continuity of care between intensive and stepped-down services may have been interpreted differently by the self-assessors and the independent assessors. In addition, work incentives planning was rated higher by independent assessors than by self-assessors (average difference of .73 points), which may reflect the modesty of self-assessors regarding incentives planning, changes in programs between the assessment times, or other differences in interpretation. We also found some variability across items, as measured by low ICCs and large differences in Cohen’s d effect sizes in this small sample of 11 programs. If this variation is found to be stable across other samples, it may indicate that self-assessments may in some cases provide a valid snapshot of overall program functioning but that independent assessors may be better at identifying nuanced areas for improvement in individual items.
Because of biases often found with self-reports (
14,
15), several conditions may have contributed to these findings. The fidelity scale is well designed and contains many concrete details and operational definitions to guide its use. This user-friendly aspect should not be overlooked. As noted previously, PROS program staff were taught about IPS fidelity and how to conduct fidelity self-assessments. It appears that they learned well. It is also possible that the learning collaboratives’ emphasis on continuous quality improvement resulted in an implementation environment that was experienced as safe enough for participants to report data honestly and without bias. In addition, our ongoing contact with and knowledge about these programs may result in less likelihood of dishonest reporting, although this is speculation.
This study had clear limitations, including a small sample, five-month average between the two methods of assessment, small number of employment staff per program, significant amount of training made available to program staff (which may not be representative of training typically available to programs attempting self-assessment), and inability to empirically test the conditions contributing to the findings. Future studies may choose to address these issues and to attempt to answer important questions, such as when fidelity self-assessments may (and may not) be appropriate, what circumstances indicate the need for independent assessors, and whether there is a difference in the impact of self-assessments versus independent assessments when assessments are used for continuous quality improvement.
Conclusions
This study, which used the IPS Supported Employment Fidelity Scale, focused on the relationship between self-assessment and independent assessment of fidelity. No significant differences were found between mean total fidelity scores when the two methods were used to assess 11 community mental health programs. However, we found some variation on individual scale items. Future research should examine whether these trends characterize larger samples. The results may suggest that self-assessments are useful under certain circumstances but that independent assessors are able to identify nuances and differences in individual items. Both self- and independent assessments may be useful for programs and policy makers in appropriate contexts.