Verification of program fidelity (
1) is helpful in addressing the problem of inadequate implementation of evidence-based practices (
2) and the associated decline in program outcomes (
3,
4). However, it is time-intensive, often requiring a day or more for an onsite visit and another day to score and write the report. A 2007 national task force identified several innovative approaches to address practical concerns such as costs and burden of quality improvement (
5), including the use of alternative methods such as phone-administered assessments.
In a previous study of assertive community treatment (ACT), we found that fidelity assessment by phone was reliable and produced scores consistent with and comparable to those from onsite assessment. To facilitate phone-administered fidelity assessment, we created a self-report protocol (
6) that required team leaders to gather and report information sufficient to score each fidelity item from the Dartmouth Assertive Community Treatment Scale (DACTS) (
7). Our impression was that phone-administered assessment mostly verified information already captured from the self-report protocol, leading us to speculate that self-reported assessment might be an even less burdensome, alternative fidelity assessment method. In the study reported here, we examined the interrater reliability and concurrent validity of self-reported fidelity assessment and phone-administered fidelity assessment.
Methods
Twenty-four ACT teams in Indiana were invited to participate. Eight teams declined, having chosen not to maintain ACT certification because of changes in state funding. All 16 participating ACT programs (67%) had been in operation for a minimum of five years, followed Indiana ACT standards, and received annual fidelity assessments to verify certification as an ACT provider (
6). Data were collected between December 2010 and May 2011. ACT team leaders provided written informed consent to participate in the study. Study procedures were approved by the Indiana University–Purdue University Indianapolis Institutional Review Board.
The 28-item DACTS (
7) was used to assess ACT fidelity. The DACTS provides a total score and three subscale scores: human resources (for example, psychiatrist on staff), organizational boundaries (for example, explicit admission criteria), and nature of services (for example, in vivo services). Items are rated on a 5-point behaviorally anchored scale (5, fully implemented; 1, not implemented). Items with scores of 4 and higher are considered well implemented. The DACTS has good interrater reliability (
8) and can differentiate between ACT and other types of intensive case management (
7).
A self-report fidelity protocol was used to collect all data needed to score the DACTS (
6,
7). The protocol consists of nine tables designed to summarize data efficiently: staffing, caseload and discharges, admissions, hospitalizations, client contact hours and frequency, services received outside ACT, engagement, substance abuse treatment, and miscellaneous (program meeting, practicing team leader, crisis services, and use of informal supports). A critical aspect of the protocol is the conversion of subjective, global questions into objective, focused questions (for example, rather than asking for a global evaluation of treatment responsibility, team leaders record the number of clients who have received a list of services outside ACT during the past month).
Teams received the self-report fidelity protocol two weeks before the phone interview. Team leaders consulted clinical and other program records to complete the protocol and returned it before the call. ACT staff contacted the research team with questions.
Phone interviews were conducted with the ACT team leader. For three of the 16 teams (19%), additional individuals participated in the call as observers (for example, the medical director). Phone interviews were conducted jointly by the first and second authors and focused on reviewing the self-report fidelity protocol for accuracy. When the data were incomplete (ten teams had incomplete data), data were identified by research staff and provided by the staff at the site before the call (two teams), gathered during the phone interview (six teams), or submitted within one week after the call (two teams). Raters independently updated the self-report fidelity protocol to reflect information gathered during or after the interview and independently scored the DACTS on the basis of the revised information. Discrepant DACTS items were then identified, and raters met to discuss and assign the final consensus scores.
The self-reported assessment was conducted by two new raters (third and fourth authors), who did not participate in the phone-administered assessments. They independently scored the DACTS with information only from the self-report fidelity protocol as originally provided by the sites or as amended with missing data provided before the phone call (two teams). Scoring using the self-reported data was completed after the phone interviews but did not include information obtained during the phone interviews. Raters left DACTS items blank (unscored) if data were unscorable or missing. Discrepant DACTS items were then identified, and raters met to discuss and assign consensus scores. All four raters had at least one year of training and experience in conducting DACTS assessments as part of previous fidelity studies.
Two indicators were used to assess interrater agreement (reliability) and intermethod agreement (validity): consistency, calculated with the intraclass correlation coefficient (ICC), and consensus, estimated from the mean of the absolute value of the difference between raters or methods (
9). Scores were compared for the DACTS total scale and for each subscale. Sensitivity and specificity were calculated to assess classification accuracy.
Results
Self-reported data were missing for nine of the 16 teams. The maximum number of missing items for a team was two (mean±SD=.81±.83). Because phone raters gathered missing data during or immediately after the interview, there were no missing data for the phone-administered assessment. DACTS total and subscale scores were calculated by using the mean of nonmissing items for the self-reported assessments.
Reliability of the phone-administered assessment was generally good. Interrater reliability (consistency) was very good for the total DACTS (ICC=.98) and for the human resources (ICC=.97) and nature of services (ICC=.97) subscales, and it was adequate for the organizational boundaries subscale (ICC=.77) (
Table 1). Absolute differences between raters were small, indicating good consensus, for the total DACTS (mean difference=.04; differences <.25 [5% of the scoring range] for all 16 sites) and for the human resources subscale (mean difference=.05; differences <.25 for 15 of 16 sites), the organizational boundaries subscale (mean difference=.06; differences <.25 for all 16 sites), and the nature of services subscale (mean difference=.07; differences <.25 for 15 of 16 sites).
Reliability of the self-reported assessment varied by subscale. Interrater reliability (consistency) was acceptable for the total DACTS (ICC=.77) and the nature of services subscale (ICC=.86) but below recommended standards for the organizational boundaries subscale (ICC=.61) and the human resources subscale (ICC=.47) (
Table 1). Absolute differences between raters (consensus) were small to medium for the total DACTS (mean difference=.14; differences <.25 for 13 of 16 sites) and the organizational boundaries subscale (mean difference=.13; differences <.25 for 13 of 16 sites) but were somewhat larger for the nature of services subscale (mean difference=.20; differences <.25 for 11 of 16 sites) and the human resources subscale (mean difference=.25; differences <.25 for ten of 16 sites).
The self-reported fidelity assessment was an accurate and valid predictor of the phone-administered fidelity assessment (that is, it demonstrated acceptable levels of consistency and consensus) (
Table 2). ICCs indicated moderate to strong agreement (consistency) for the total DACTS (ICC=.86) and the nature of services subscale (ICC=.92) and adequate agreement for the human resources subscale (ICC=.74) and the organizational boundaries subscale (ICC=.71). Absolute differences between self-reported assessments and phone-administered assessments (consensus) tended to be small for the total DACTS (mean difference=.13; differences <.25 for 15 of 16 sites) and the organizational boundaries subscale (mean difference=.08; differences <.25 for 15 of 16 sites) but were somewhat larger for the nature of services subscale (mean difference=.20; differences <.25 for 12 of 16 sites) and the human resources subscale (mean difference=.15; differences <.25 for ten of 16 sites) (
Table 2). Of interest, self-report DACTS total scores underestimated the phone scores for 12 of 16 sites.
The sensitivity and specificity of the self-reported assessment method were calculated to determine whether this method made accurate classifications in situations that required a dichotomous judgment (for example, ACT versus non-ACT). The DACTS total score obtained via phone-administered assessment served as the criterion. Teams scoring 4.0 or higher were classified as meeting ACT fidelity standards. Self-reported assessment had a sensitivity of .77, a false-positive rate of .0, a specificity of 1.00, a false-negative rate of .23, and an overall predictive power of .81 in predicting the outcome of phone-administered assessment.
Item-level analyses were undertaken to help identify potential problem items. Mean absolute differences between raters of the self-reported assessment (reliability) and between consensus ratings for the self-reported and phone-administered assessments (validity) were examined to identify highly discrepant items. Mean absolute differences exceeding .25 were found between phone raters for seven items: vocational specialist on team (.46), time-unlimited services (–.42), contacts with informal support system (–.38), staff continuity (–.37), dual-diagnosis model (–.33), intake rate (–.31), and nurse on team (–.31). Mean absolute differences exceeding .25 were found between the two assessment methods in consensus ratings for five items: dual-diagnosis model (–.76), vocational specialist on team (–.63), contacts with informal support system (–.44), 24-hour crisis services (–.38), and peer counselor on team (–.37). Most differences were attributable either to site errors in reporting data or to rater errors in judging correctly reported data. Changes to the protocol could be identified to improve two items—24-hour crisis services and presence of a trained vocational specialist on the team (for example, the protocol could ask assessors to specify the percentage of clients calling crisis services who spoke directly to an ACT team member).
Discussion and conclusions
The results provide preliminary support for the reliability and validity of the DACTS total scale when scores are calculated solely from self-reported data. When restricted to the total DACTS score, which is used to make overall fidelity decisions, rating consistency and consensus were good to very good between raters of the self-reported fidelity assessment (reliability) and between the raters’ mean scores for the phone-administered assessment and the self-reported assessment (validity). In addition, self-reported assessment was accurate; for the total DACTS score, it agreed with the phone-administered assessment within .25 scale points (5% of the scoring range) for 15 of the 16 sites (94%) and had a sensitivity of .77, a specificity of 1.0, and overall predictive power of .81 for dichotomous judgments. Moreover, there was no evidence for inflated self-reporting. Self-reported fidelity assessment underestimated phone-administered fidelity assessment for most sites. These findings are in contrast to those of previous research indicating that self-report data are generally less accurate and positively biased (
10–
12), especially when data are subjective or require nuanced clinical ratings. However, prior research has not tested a self-report protocol specifically created to improve accuracy by reducing subjectivity and deconstructing complicated judgments. Moreover, prior studies allowed self-reporters to score their own program. In our study, the self-report data were scored by independent raters.
Results for the DACTS subscales were mixed. Although reliability was very good for the nature of services subscale, it ranged from low-acceptable to unacceptable for the organizational boundaries and human resources subscales. A similar pattern was found for validity; it was excellent for the nature of services subscale but acceptable to low-acceptable for the other two subscales. Differences in the number of problem items across subscales appear to underlie the lower reliability and validity (two of ten problem items for the nature of services subscale, compared with four of 11 problem items for the human resources subscale and three of seven problem items for the organizational boundaries subscale). Of interest, initial results from an ongoing study that is comparing onsite fidelity assessment with phone-administered and self-reported fidelity assessments that include the modifications identified for the two problem items show improvements in subscale reliability and validity.
The study had several limitations. The sample consisted of previously certified ACT teams in a single state with clearly defined standards for ACT certification. In addition, they were mature teams with extensive experience in fidelity assessment, had undergone prior technical assistance and had a history of generally good prior fidelity, and were willing to commit the time required for a detailed self-assessment. These concerns limit both the generalizability and the range of fidelity explored. Also, the carefulness, comprehensiveness, and accuracy of the self-report data may have been affected either positively (data will be checked) or negatively (data can be fixed later) by the requirement for concurrent phone-administered assessment. In addition, we used phone-administered fidelity assessment as the criterion fidelity measure because previous research had demonstrated evidence of validity (
6). However, future research is needed to confirm that self-reported fidelity assessment of ACT fidelity is valid compared with onsite assessment. Similarly, because phone-administered assessment and self-reported assessment shared a rating source, conclusions about agreement are limited to comparisons across collection methods and not across collection sources (for example, independent data collection by an onsite rater). Also, the DACTS includes several objective items that do not require clinical judgment, which may limit generalizability to fidelity scales with similar types of items.
Despite limitations, this study provides preliminary evidence for the viability of self-reported assessment of ACT fidelity. However, several caveats to its use should be made. First, self-reported fidelity assessment is most clearly indicated for gross, dichotomous judgments of adherence for which the total scale is used, and it is likely not as useful or sensitive for identifying problems at the subscale or individual-item level (for example, for quality improvement). Second, although the self-report method entailed some time savings for the assessor, there was little savings for the site beyond not having to participate in a phone interview. Moreover, the self-report method entailed a cost in missing data and in lower overall reliability and validity.
Third, as is true for phone-administered fidelity assessment, self-reported assessment cannot—and should not—replace onsite assessment of fidelity. Instead, all three methods could be integrated into a hierarchical fidelity assessment approach (
6,
13). For example, onsite assessment is likely needed when the purpose of the assessment is quality improvement; it is also likely needed for assessing new teams and teams experiencing a major transition or trigger event (for example, high team turnover or decrements in outcomes). Self-reported fidelity assessment is likely appropriate to assess overall adherence for stable, existing teams with good prior fidelity. In addition, self-reported assessment probably is most appropriate as a screening assessment, confirming that prior levels of fidelity remain stable, rather than as the sole indicator of changed performance. That is, evidence for substantial changes at a less rigorous level of assessment (for example, self-reported assessment) will require follow-up assessment that uses more rigorous methods (for example, phone-administered assessment followed by onsite assessment) to confirm changes.
Acknowledgments and disclosures
This study was funded by an Interventions and Practice Research Infrastructure Program grant from the National Institute of Mental Health (R24 MH074670; Recovery Oriented Assertive Community Treatment). The authors thank the ACT team members and other program staff for collection of data for this study.
The authors report no competing interests.