Literally dozens of clinical and legal settings call for violence risk assessment and management by mental health professionals (
1,
2). One example is release from forensic psychiatric hospitalization, the setting of the study reported in this article.
A prominent development in the risk assessment field has been the focus of research on instrumentation and models of decision making (
3,
4,
5,
6,
7). Two traditional methods for making decisions—clinical and actuarial models—have been discussed in the medical and behavioral sciences literatures (
8,
9,
10) and have been applied to violence risk assessment. The clinical method has been described as an "informal, 'in the head,' impressionistic, subjective conclusion, reached (somehow) by a human clinical judge" (
9). In contrast, the actuarial method has been described as "a formal method" that "uses an equation, a formula, a graph, or an actuarial table to arrive at a probability, or expected value, of some outcome" (
9).
Some consensus exists among commentators that sole reliance on unstructured clinical decision making is inadequate for conducting risk assessments (
11). Actuarial prediction methods have been applied to samples of psychiatric patients and have achieved high levels of statistical accuracy (
4,
5,
12). Despite this achievement, commentators (
11,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
21,
24) have noted potential shortcomings associated with strict actuarial models of prediction, including potential lack of generalizability and applicability beyond samples of development; difficulty of replicating clinical reality by using actuarial methods; tendency of actuarial methods to exclude potentially important risk factors; rigidity and lack of sensitivity to change; and failure optimally to inform violence prevention and risk management.
Another research-based model of risk assessment—structured professional judgment—uses a professional guideline approach to decision making (
14,
17,
18,
19,
22). Several sets of professional guidelines have been developed under the structured professional judgment approach (
3,
6,
25,
26,
27), including the HCR-20 violence risk assessment scheme (
6), so named for its 20 risk factors in three domains—historical, clinical, and risk management. In structured professional judgment risk assessment, evaluators review all relevant clinical data to determine the presence of specific risk factors, which themselves are included (defined and operationalized) in professional manuals on the basis of their association with violence in the scientific and professional literatures. On the basis of these factors, an overall judgment of risk is made, referred to here as the structured final risk judgment.
Although there are no fixed guidelines about how risk factors are combined to reach an overall judgment, structure is imposed on the decision-making process in several ways: specifying a list of empirically supported risk factors, operationalizing these risk factors, providing fixed scoring guidelines for the factors, and providing some guidance for making final decisions of low, moderate, or high risk (again, the structured final risk judgment). A key assumption underlying the structured professional judgment approach is that professional discretion is potentially valuable and appropriate for the assessment of risk, although a degree of structure is necessary to reduce the complexity of the clinical task and guide the exercise of discretion.
There are several steps in the validation of the structured professional judgment model and its specific instruments (
14), including establishing whether the risk factors chosen for inclusion in schemes can be scored reliably and whether they actually relate to violence. A fair amount of research has been published on this topic—a series of research studies has established the interrater reliability of evaluators' judgments about the presence of various risk factors as well as the validity of the judgments in both retrospective and prospective studies (
12,
28,
29,
30,
31,
32,
33,
34,
35,
36,
37,
38,
39).
However, few studies have examined the reliability or validity of the structured final risk judgments made under the structured professional judgment model (
35,
40), despite the fact that these judgments represent the central intended clinical use of these measures. In the existing measures, structured final risk judgments of low, moderate, or high risk typically are made according to the likelihood of violence and the degree of intervention the case will require. The ratings reflect not only the presence of risk factors but also their perceived relevance, interactions among them, and the likelihood that they will be managed effectively through monitoring, treatment, supervision, and victim safety enhancement. Such overall judgments are critically important in clinical decisions about such matters as case prioritization and intervention.
In this study, we evaluated the structured final risk judgments made on the basis of the HCR-20 violence risk assessment scheme (
6), which is the structured professional judgment measure about which the most research has been published. We had three research questions. First, can HCR-20 items, scales, and—particularly—structured final risk judgments be made with acceptable interrater reliability? Second, are HCR-20 items, scales, and—particularly—structured final risk judgments associated with future violence? Finally, what is the incremental validity of structured final risk judgments with respect to actuarial (arithmetic) combinations of HCR-20 item ratings?
To address these questions, we conducted a pseudo-prospective community follow-up study of 100 forensic psychiatric patients. The design was pseudo-prospective in the sense that predictive measures—that is, the HCR-20 items and judgments—were completed on the basis of information available at discharge (1996 to 1997), although the actual research coding was done later (2000-2001). The follow-up period occurred later (from discharge up to 2001) than the time frame for coding the HCR-20. This approach is common to risk assessment research (
5,
12).
Methods
Study participants
The sample consisted of 100 forensic psychiatric patients who were found not guilty by reason of a mental disorder and who subsequently had one or more release hearings. The participants were drawn from a larger, ongoing prospective study of the predictive validity of the HCR-20. The larger study of 175 patients uses version 1 of the HCR-20 (
41) and does not include the structured final risk judgments. Several preliminary reports of data from the larger study have been made (
42,
43,
44). From 116 of 175 patients who were released from hospitalization in the period 1996 to 1997 (the study period), 100 were randomly selected to serve as participants in the study reported here. The remaining 16 patients participated in piloting and training activities. The study was approved by the institutional review board of Simon Fraser University.
In the sample of 100, most participants were male (91 percent), did not have children (30 percent), and were unmarried (67 percent). A majority of the patients were unemployed (93 percent). Less than half (40 percent) had completed high school. Most (92 percent) had a previous violence charge, 48 had a previous violence conviction, a majority (79 percent) had a current violent index offense, and many (34 percent) had a juvenile record. A majority (96 percent) had previously received psychiatric treatment, including inpatient treatment (83 percent). Primary diagnoses were schizophrenia (73 percent), mood disorders (18 percent), substance-related disorders (5 percent), and other (3 percent). A quarter of the sample (24 percent) had a diagnosis of a personality disorder.
Procedure
HCR-20 (version 2). The HCR-20 comprises 20 key risk factors in three domains—historical, clinical, and risk management, which are listed in
Table 1. The historical domain reflects factors related to past conduct, mental disorder, and social adjustment, which typically are documented or established in official records. The clinical domain reflects factors related to current psychological functioning, which typically are observed or inferred from recent behavior. The risk management domain reflects factors related to future adjustment problems, which are speculated or anticipated on the basis of historical and clinical factors as well as plans and goals.
As recommended in the test manual, the 20 items were rated 0, absent; 1, possibly or partially present; or 2, definitely present. As is commonly done for research, we summed the numerical item ratings to yield four dimensional scores: historical scores, clinical scores, risk management scores, and HCR-20 total scores (all items). Thus possible HCR-20 total scores ranged from 0 to 40. Raters then made structured final risk judgments of 1, low; 2, moderate; or 3, high (explained in further detail in the manual). Several other ratings were used as covariates in some analyses—for example, confidence ratings and whether HCR-20 items were considered to be "critical" or "criminogenic." A rating of critical indicates that the raters believed that the item could, on its own, compel a rating of high risk (
27). A rating of criminogenic indicates that the raters believed the items in the case at hand were relevant to risk of violence.
The raters were two master's-level clinicians and clinical psychology graduate students. This small number of raters was chosen to minimize rater effects. Both raters had clinical experience in psychiatric, correctional, and forensic settings; had completed core courses in clinical psychology and forensic mental health; and were trained specifically to use the HCR-20, including completion of three sample cases. After training, the raters independently completed the HCR-20 for five individuals who had been acquitted by reason of a mental disorder and resolved difficulties with the trainer. The raters gathered clinical information from the full clinical-legal files of participants as they existed at the time of discharge. The files were rich and detailed, containing social, psychological, psychiatric, medical, criminal, and legal information. Each rater assessed 75 participants, with an overlap of 50 percent—one rater coded participants 1 through 75 and the other coded participants 26 through 100, so that participants 26 through 75 were coded twice, for interrater reliability. This training and coding procedure is representative of HCR-20 research and consistent with the manual.
Detection of violence. Violence in the community was coded both on the basis of criminal records of convictions and from clinical files after discharge from the hospital by separate raters who were blinded to HCR-20 ratings. Clinical files were based primarily on outpatient forensic psychiatric, psychological, and nursing contacts, usually conducted at regularly scheduled intervals—for example, monthly. However, clinicians were not part of data collection, so standardized violence interviewing procedures were not possible. The clinical files included reports from patients, families, and treating professionals during the course of the follow-up. Thus, although only two file-based sources were used to detect violence, one of these included self-reports and collateral reports. In accordance with the HCR-20 manual, violence was defined as actual, attempted, or threatened physical harm of another person. Acts of violence were divided into broad categories of any violence, physical violence, and nonphysical violence, which is consistent with the approaches used in other risk assessment research (
12,
45,
46,
47).
Statistical analyses
Intraclass correlations (ICCs) were used for reliability analyses. The ICC is a measure of chance-corrected agreement rather than association (such as Pearson's r) and hence is sensitive to additive and multiplicative biases between raters (
48). ICC is mathematically equivalent to a weighted kappa (
49,
50).
To address validity analyses, several statistical procedures were used, including receiver operating characteristic (ROC) analysis (
51). ROC analysis is independent of the criterion base rate and produces an effect—the area under the curve (AUC)—by plotting sensitivity and specificity pairs for each possible cutoff score on a measure. The AUC equals the probability that a violent person will receive a higher score on the predictor than a nonviolent person.
Survival analysis was used to evaluate whether HCR-20 structured final risk judgments added incrementally to numerical scores. This analysis uses time to an event as the dependent measure, models the time to an event, and controls for unequal follow-up times between participants (
52).
All statistical analyses were conducted with use of SPSS, version 10.1 (
53).
Results
Descriptive variables
The mean±SD total HCR-20 score of the 100 patients in our sample was 24.70±4.64, and the range was 11 to 36. For the historical scale, the mean was 14.14±2.79 (range, 6 to 19); for the clinical scale, 4.68±2.02 (range, 0 to 10), and for the risk management scale, 5.88±1.49 (range, 2 to 9). The proportions of patients who were violent after release were 14 percent for nonphysical violence, 15 percent for physical violence, and 22 percent for any violence. The average time from release to follow-up was 42.91±13.29 months (median, 45.27 months; range, .13 to 63.07 months).
Reliability
The ICCs of the two raters for the HCR-20 items and scales, based on the 50 overlapping patients, are listed in
Table 1. A one-way random-effects model of the ICC was used for both the reliability of single-rater ratings (ICC
1) and averaged ratings (ICC
2). ICC
1 was considered the primary index of reliability, but ICC
2 was used to gauge the potential reliability of averaged ratings. For individual historical items, ICC
1 ranged from .41 (historical scale, item 4) to 1.0 (historical scale, item 7). Because the latter item reflected mere transcription of preexisting psychopathy scores (
54), item 8 on the historical scale attained the highest actual interrater reliability (ICC
1=.89). Most ICC
1 values (eight of ten) were equal to or greater than .70. ICC
2 values paralleled this pattern, but, as expected, were higher. Most ICC
2 values (eight of ten) were equal to or greater than .80. ICC
1 values for the clinical scale ranged from .34 (item five) to .69 (item 3). None of the values was greater than .70. Items on the risk management scale were problematic; ICC
1 values ranged from .01 (item 5) to .54 (item 3). None of the values was greater than .60.
Agreement for the structured final risk judgments is summarized in
Table 2. The two raters agreed in the case of 35 (70 percent) of the 50 overlapping patients, and there were no "low/high-risk" errors. Chance-corrected agreement (ICC
1, or weighted kappa) was .61, (p≤.001, 95 percent confidence interval [CI]=.41 to .76); ICC
2 was .76 (p≤.001, CI=.58 to .86).
Validity
The proportions of each type of violence across structured final risk judgments of low, moderate, and high risk are shown in
Table 3. These judgments were related to each type of violence. AUC values from ROC analyses for these HCR-20 clinical judgments were statistically significant for each outcome criterion, as can be seen from
Table 4. AUCs for the HCR-20 structured final risk judgments varied between .68 and .74, depending on the violence index.
Kaplan-Meier bivariate survival analysis was conducted for each outcome criterion. When "any violence" was used as the dependent measure, structured final risk judgments emerged as a significant predictor (log rank=21.1, p<.001). Results were similar when "physical violence" was used as the outcome (details can be obtained from the first author). The survival function is shown in
Figure 1, illustrating that patients who were judged to be high risk were more likely to be violent—and to be violent sooner—than other patients.
Multivariate analyses
Cox regression analyses were carried out with use of "any violence" as the outcome. The results were highly similar for physical violence. First, scores on the historical, clinical, and risk management scales were directly entered as block 1. On block 2, the HCR-20 structured final risk judgments were entered by using the forward conditional method. This entry procedure was used so that all the HCR-20 numerical scores were included in the final model but that HCR-20 structured final risk judgments would be included only if they significantly improved the overall model.
The results are presented in
Table 5. When "any violence" was used as the outcome measure, the historical, clinical, and risk management scores together produced a significant model fit on block 1 (-2 log likelihood=181.754, χ
2=9.904, df=3, p≤.05). Only the clinical scale was a significant predictor in the model. The HCR-20 structured final risk judgments were then entered as block 2 and produced a significant improvement to the model's fit (χ
2 change=9.828, df=1, p≤.01). The HCR-20 structured final risk judgments were most strongly related to violence, over and above the actuarial scores.
Analyses were conducted that also added potentially relevant covariates to block 1 of the Cox regression model: psychopathy score on the Psychopathy Checklist-Revised (PCL-R) (
54), gender, violent index offense, critical item summation, criminogenic item summation, and numerical confidence (scored 1 to 10). This initial block was not significant, although the clinical domain was (-2 log likelihood=178.556, χ
2=12.276, df=9, p=.20). This result probably stemmed from lower power associated with a higher number of predictors. Use of a forward conditional (stepwise) entry procedure resulted in a significant overall model, because fewer variables entered the overall model. The addition of the structured final risk judgments as the second block improved the model's fit (χ
2 change=9.615, df=2, p≤.01). Only the structured final risk judgments were significant in the final model. As such, these analyses show that structured final risk judgments—or clinical judgments—added incrementally to not only numerical HCR-20 scores but also to other potential predictors as well, used actuarially.
Discussion
Actuarial and structured professional judgment models of risk assessment have been developed in response to unstructured clinical prediction. In this study we sought to evaluate the HCR-20 generally and to evaluate one aspect of the HCR-20 and the structured professional judgment model specifically—that is, the model's structured final risk judgments intended for use by clinicians. Bivariate analysis showed that such judgments predicted violence with moderate to large statistical effects. That is, AUC values from .68 to .74, converted into Cohen's d with transformational procedures provided by Dunlap (
55) and on the basis of Cohen's (
56) suggestions for guidance regarding the size of effects (d≥.80 is considered large), suggested that these AUCs were moderate to large. Importantly, the judgments added incrementally to models consisting of HCR-20 numerical indexes and to models including other potentially important covariates. These validity findings based on the HCR-20 are consistent with those of studies of other structured professional judgment measures (
35,
41).
Interrater reliability of structured final risk judgments (.61 and .76) was "good" (
49) to "substantial" (
50), without instances of low/high-risk disagreements between raters. These findings also parallel those of others (
35), who reported ICC values of .57 to .61 for structured final risk judgments with use of another measure (
27). Some authors have warned against the use of clinical judgments in risk assessment (
5,
9) and clinical decision making more generally (
10,
57,
58,
59). The study reported here, along with other research (
35,
40), suggests that this "lack of utility" position ought to be revisited with respect to structured clinical decisions made on the basis of the structured professional judgment model of risk assessment. This model addresses some of the concerns about unstructured clinical judgment, such as subjectivity, lack of attention to important risk factors, and variability across clinicians. It also has been shown in three of three direct comparisons (including in this study) to be more strongly related than actuarial predictions to relevant outcome criteria (
35,
40).
Limitations of this study include its pseudo-prospective design and use of ratings based solely on data obtained from patients' files. It is likely that both of these factors limited reliability and validity, because they precluded optimal measurement procedures, the specific targeting of study constructs by raters, and they necessitate reliance on the clinical reporting of other professionals. However, in this study file-based clinical information included numerous richly descriptive psychiatric, psychological, social work, nursing, legal, and other reports and documents. Furthermore, the design is a reasonable alternative to a true prospective design if raters are kept blinded, because it permits statements to be made about the relationship between predictors and subsequently occurring criteria.
Another limitation was the use of only two raters. This approach was used to minimize substantial rater effects. However, subsequent research will need to address such possible effects—for example, does the rater's gender or race matter? Our study was considered a prerequisite to these other necessary efforts. Furthermore, the relationship between numerical HCR-20 ratings and violence, although statistically significant (in the case of total HCR-20 scores and clinical scores, was smaller than in some previous HCR-20 studies (
12).
Whether our findings would be replicated if the numerical findings were stronger is an empirical question. Another important question is whether the findings would be generalizable if more serious forms of violence could be measured (
4). The results of our study were similar whether physical (more serious) or nonphysical (less serious) violence was used as a criterion. Similar findings have been reported for other HCR-20 studies (
12).
Measures were coded for research purposes, so HCR-20 scores did not follow the patients. However, the treating psychiatrists probably would have included an HCR-20 completed independently for clinical practices, or a risk assessment of some kind, in their discharge summaries. It is unclear what, if any, effect this practice would have on the validity of the HCR-20 indexes collected in this study for research purposes. A higher HCR-20 score could cause increased surveillance, leading to observation of more violence. It also reasonably could lead to more effective risk management and treatment, leading to fewer episodes of violence to observe. Whatever the effect, it was indirect because the outpatient clinicians did not have the HCR protocols used in this study.
Conclusions
The results of this study provide reasonable support for the decision-making scheme of the structured professional judgment model of risk assessment and are inconsistent with the position that clinical decisions about violence risk are perforce unreliable and invalid. The study evaluated one piece of the structured professional judgment model. Subsequent studies should consider whether use of the structured professional judgment model actually prevents subsequent violence—an explicit goal of the structured professional judgment model (
13,
14,
17,
18,
19) and an emerging conceptual theme in the field more broadly (
15,
60,
61). That is, the structured professional judgment model intends to guide clinicians to arrive at a risk level through consideration of what risk factors are present, their salience for the individual at hand, and the attendant risk management and intervention strategies that will reduce these factors.
Acknowledgments
The authors thank Jodi Viljoen and Steph Griffiths for their research assistance. They also thank Christopher D. Webster, Ph.D., and Derek Eaves, M.B., Ch.B., who along with Dr. Hart received a grant from the British Columbia Health Research Foundation that funded the original study from which the participants in this study were selected.