In 1986 in a seminal article on the need for a theory of psychiatric treatment systems for persons with serious mental illness, Hargreaves (
1) wrote, “To optimize treatment system effectiveness within available resources, we need some logical tools to help us calculate the implications of the knowledge gained from clinical trials. . . . This requires a theory of the way treatment systems interact with the life course of persons in each major target group. . . . Such a theory of mental health services must be sufficiently detailed and valid to forecast an array of impacts of proposed system changes. Such a theory would be a stimulus and guide to research, as well as a tool for program management.”
Hargreaves proposed that a stochastic model (that is, one based on outcomes expressed as probabilities) in the form of a discrete first-order stationary Markov process (that is, one consisting of a fixed spectrum of outcome states and probabilities of transitioning between states that remain the same in different time periods) provides a promising approach to the formulation of such a theory. This article assumes a theory similar to that of Hargreaves, operationally defining outcomes as transition probabilities (TPs) from functional levels (FLs) (described below) prior to the receipt of services to destination FLs after receipt and estimating how service system characteristics (also described below) affect these transitions. An example of a TP would be the probability of moving from having acute symptoms to not being acutely symptomatic after receiving recovery-oriented services for one month.
Populating a model for program management and planning requires different types of data, but, as Hargreaves noted, perhaps the most difficult to estimate for a Markov model is data on treatment system outcomes in the form of TPs. Fortunately, since the publication of Hargreaves’ article, a number of studies of psychiatric treatment systems presenting TP data have appeared. This article presents the first random-effects meta-analysis of such data known to us. Random-effects meta-analyses synthesize the outcomes of related but not identical studies, taking into account both within- and between-study sources of sampling error (
2).
For this meta-analysis, we identified studies of persons with diagnoses of schizophrenia, bipolar disorder, borderline disorder, and antisocial personality disorder, referring to these collectively as “serious mental illness.” Our theoretical rationale for this focus was not because these diagnoses provide an exhaustive account of serious mental illness—they do not (
3–
6)—but because persons in these groups have similar needs for community systems, are typically treated in public systems, are frequently classified in ways that align with the FL system we use, and are generally included in definitions of serious mental illness. Our practical reason was that persons with these diagnoses are frequently the focus of studies of public mental health systems.
There is widespread agreement that public community mental health systems for persons with serious mental illness are in crisis, which has resulted in large numbers of incarcerations and homelessness and has strained emergency room and inpatient resources. Data from 2002 and 2005 suggest that census declines in state psychiatric hospitals are reversing (
7), such that some have called for a “return to the asylum” (
8). Although this crisis is undoubtedly a result of resource constraints, it is also attributable to inadequate systems planning of services and unrealistic estimates of the resources required to provide these services. In 1979, Bachrach (
9) noted that planning for deinstitutionalization was inadequate and later wrote (
10), “Although some planners and planning agencies continue to stress the development of model programs as solutions for the varied problems of deinstitutionalization, discrepancies between isolated successful model endeavors and widespread service system failures are becoming so apparent that the need for systems-oriented planning strategies is increasingly acknowledged.”
Markov TP–based planning models have the potential to improve mental health system planning by clarifying services and resources necessary to adequately care for persons with serious mental illness. A concrete example resulting from the application of a Markov model to settle a right-to-treatment suit in Arizona was provided by Leff and colleagues (
11). Although by no means guaranteeing that services and resources required will be provided, Markov planning models can alert mental health system stakeholders to relationships between needs and services, magnitude of need, and extent and consequences of shortfalls. Markov models also allow for more nuanced theories of service recipient outcomes through subgroup analyses and system component planning. As James and colleagues (
12) noted, “Health state models have several distinct advantages over traditional . . . approaches to analyzing data for complex diseases such as schizophrenia. First, they provide a convenient framework for performing longitudinal analyses. . . . Second, the partitioning of the population into health states leads to a more richly informative analysis of the differences between populations than simply examining mean differences. For example, it may be the case that one population does not dominate the other in terms of overall level of health but that extreme states are more common in one group than the other. Finally, stationary distributions can be combined with a wide variety of outcome variables, such as costs [for planning].”
Markov models can be usefully contrasted with conventional growth modeling approaches. As Jung and Wickrama (
13) noted, conventional growth modeling approaches assume that individuals come from a single population and that a single growth trajectory can adequately approximate an entire population. These approaches also assume that independent variables and covariates affecting growth factors influence each individual in the same way. Yet, theoretical frameworks and existing studies, such as the one reported here, often categorize individuals into distinct subpopulations differentially affected by treatments and covariates. Markov approaches more fully represent the heterogeneity of subpopulation growth trajectories within larger populations.
This study had two goals: to contribute to our theoretical understanding of psychiatric treatment system effectiveness and to generate TP inputs from multiple independent studies for more realistic Markov modeling.
Specific Objectives
In this article, we describe a methodology for meta-analyzing system outcomes in the form of Markov TPs between discrete states. We use the generic term “discrete states” (
14), rather than “health states,” because some studies base states on services used rather than on functioning or symptoms. We also generated outcome estimates for analyzing the performance of systems and modeling by measuring TPs associated with different system types.
Furthermore, we tested an evidence-oriented theory of mental health systems proposing that systems consisting of more comprehensive, evidence-based, and rehabilitation-oriented services would produce better outcomes than systems that are less comprehensive. Specifically, we tested the hypothesis that TPs for service systems coded as recovery oriented (R) (services more comprehensive, evidence based, and rehabilitation oriented) would be more positive and less negative than systems coded as basic (B) (services least comprehensive, minimally evidence based, and not rehabilitation oriented) or as maintenance oriented (M) (services moderately comprehensive, treatments as usual, minimally evidence based, and rehabilitation oriented) by testing predictions that TPs would be more positive for R systems than for B and M systems and would be more positive and less negative for M systems than for B systems. Our theory that service systems could be coded as B, M, and R was based on a body of evaluation and planning studies typically comparing from two to four systems categorized as “lower cost,” “services as usual,” “more restrictive,” “lower quality,” or “minimal” with those categorized as “higher cost,” “enhanced,” “less restrictive,” “higher quality, “community based,” or “evidence based” (
15–
20). Our goal was to better understand service recipient and study factors that might influence or bias TPs and identify scientific questions for further study in order to contribute to guidelines for collecting, synthesizing, and reporting TP data for scientific and planning purposes (
21,
22).
Methods
Studies were eligible if they were in English and reported Markov analyses of treatments for persons with serious mental illness. Studies could be of any type. Bibliographic databases searched included Alt-HealthWatch (EBSCOhost); BIOSIS Previews (ISI Web of Knowledge); CAB Abstracts Archive; History of Science, Technology, and Medicine; PsycINFO (EBSCOhost); PubMed (MEDLINE); and Science Citation Index Expanded (ISI Web of Science). [More details of the bibliographic databases searched are provided in an online supplement to this article.]
The total number of candidate studies identified and retrieved was 61. A total of 42 studies (69%) were excluded because they focused on disorders or conditions other than serious mental illnesses (for example, depression, posttraumatic stress disorder, anorexia, substance use disorders, and suicide), and seven (11%) were excluded for one or more of the following reasons: states could not be cross-walked to the common FL framework; there was insufficient information provided on services to code systems as B, M, or R (the case for several studies of psychiatric medications); and other data, such as numbers of observations on which TPs were based—necessary to weight probabilities in the synthesis—were not provided. Twelve usable studies remained, which provided data for 19 study-level systems (20%) (
1,
12,
23–
32; personal communication, Hughes D, April 2015).
Procedure
All steps in the procedure were reviewed by the first and second authors. Coding reliability for FLs and service system type was assessed (described below). Service recipient states were cross-walked to a common FL framework employed to align different states, the Resource Associated Functional Level Scale (RAFLS) (described below). Transitions between states, whether as distributions of persons or probabilities, were represented for each system as a TP matrix of originating and destination FL states. If time periods were other than one month, we converted TPs into monthly rates by assuming that clients exit from current states at an exponential rate—a standard assumption when analyzing transitions from a Markov perspective (
33–
35).
All study-level systems were coded as B, M, or R. Variables were extracted for studies (for example, publication date), originating FL states (for example, number of observations), and populations (for example, percentage with schizophrenia). TPs for the same system types were synthesized. Comprehensive meta-analysis random-effects option was used (
36,
37). Rows of TPs for originating FLs and study-level systems were assembled to create full TP matrices for B, M, and R systems, and cells were compared with test study predictions. TP matrices were characterized in terms of average net-positive TPs (ANPTPs) (defined below), and these measure were correlated, when data permitted, with service recipient and study characteristics by using meta-regression.
Variables
FLs.
The common FL framework for this meta-analysis was the RAFLS, a reliable and valid measure of FL for persons with serious mental illness (
11). Similar FL measures have been used frequently in mental health systems evaluation and planning (
38–
41). The RAFLS levels are as follows: FL 1, at risk, acutely symptomatic, unable or unwilling to participate in own care; FL 2, at risk, acutely symptomatic, able and willing to participate in own care; FL 3, symptoms not acute but lacking activities of daily life (ADL) skills; FL 4, possesses ADL skills, lacks community living skills; FL 5, possesses community living skills, vulnerable to stresses of everyday life; FL 6, requires specialty care but able to function except under unusual stress; FL 7, independent of the mental health system, can use generic health and human services. [Fuller definitions of these levels are provided in the
online supplement.] The cross-walk was based on definitions of consumer behaviors before and after receipt of services. When only information on transitions to and from services was provided, FL states were coded on the basis of behaviors typically associated with the services described. Likely errors associated with the latter approach are discussed below. The first and second authors coded FLs. Interrater reliability calculated as the joint probability of agreement was .9. Where coding differed, authors discussed discrepancies. Consensus was possible in all cases.
Service system type.
Service systems were coded as predominantly B, M, and R on the basis of system descriptions in the studies. If references were made to other articles or Web sites for fuller descriptions, these were consulted. Systems including only inpatient, emergency, and limited outpatient follow-up were coded as B. Systems also including a range of non–evidence-based community mental health center treatments and custodial services, such as day care, were coded as M. If reference was made to one or more evidence-based programs or to community support, psychosocial rehabilitation, or recovery, systems were coded as R. These categories are subsumptive because typically R systems offer the services of M systems and M systems offer the services of B systems. Systems might also be mixed; however, the number of studies available and level of detail about services did not support exploring this. The first and second authors also coded system type. Interrater reliability, calculated as the joint probability of agreement, was .8. If coding differed, authors discussed differences. Consensus was possible in all cases.
Study variables.
Study-level variables coded (
Table 1) were system type, study-level system description, first author, publication date, material type, study-level state measure, and RAFLS FLs coded.
Attributes of systems coded as B, M, and R.
Table 2 lists attributes of system types: number of systems, number of service recipients (unique), number of observations for transitions or TPs, percentage of studies appearing or completed in 2000 or later, designs, study-level state measure, and data sources for state information.
Study-level Markov property variables.
The accuracy of predictions based on matrices of Markov TPs is a function of the degree to which the TPs are shown to have “Markov properties” (
1,
33). Study findings were coded with respect to the three most commonly implemented tests for Markov properties: tests of “stationarity” or the stability (for example, reliability) of TPs over time (
12,
23), tests comparing whether current state alone or current state in combination with other variables best predicted subsequent states (termed first- versus second-order properties) (
24), and tests of the predictive validity of Markov TPs based on the ability of a set of TPs for one sample to predict transitions for different or hold-out samples.
Service recipient variables coded.
Table 3 lists sociodemographic and clinical variables of service recipients by system type: percentages of persons in studies who received a diagnosis of schizophrenia or a related diagnosis, bipolar disorder, depression, and comorbid substance abuse; average age; percentage male; and percentage white.
ANPTPs.
Exploring the relationship between TPs and service recipient and study variables through meta-regression required calculating a standardized summary measure of how well persons were being served for each B, M, and R matrix. We considered an outcome positive if there was a transition to FL 5 or 6 (including static TPs), and we considered an outcome negative when there was a transition to FL 1, 2, 3, or 4 (including static TPs). For each origin FL and each service type, we next computed the net-positive TP, equal to the probability of a positive outcome minus the probability of a negative outcome. To obtain an ANPTP measure for each matrix type, we then averaged the net-positive TPs over the origin FLs.
Data Analysis
Comparison of TP matrices.
Our procedures yielded a matrix of 54 cells for each system type (six origin FLs and nine destination states). Our theory yielded predictions for comparing each cell. For each of the 54 comparisons, the values in one matrix can be greater than, smaller than, or tied with the values in the other. Comparisons can be made by row and by matrix. These differences can be consistent with our predictions (+) or inconsistent (–), or in the case of ties, they can be nondiscriminating (=). The sign test is a nonparametric statistical test fitting data of this type (
42), calculating probabilities for numbers of +s and –s, with ties being excluded.
Meta-regression.
Using linear Pearson product-moment and point-biserial regression, we correlated the ANPTP measure with service recipient and study variables.
Results
Table 1 shows that study dates ranged from 1981 through 2013. Ten appeared as peer-reviewed articles. Two were theses. The 12 studies yielded 19 study-level systems, five of which were coded as B, seven as M, and seven as R. In eight instances, study-level states were based on FL or symptoms. In the 11 others, study-level state was based on service types or locations. Coding FL7 was possible for at least one example of B, M, and R systems.
System Type Variables
Table 2 shows that B study-level systems were typically termed “services or clinical care as usual”; M systems were typically termed “community mental health centers or clinics”; and R systems were typically termed “community support, enhanced, or specialized programs.” Study-level systems coded R had the largest number of unique service recipients (13,675) and the largest number of transition observations (268,168). Systems coded M had the next largest numbers (5,951 and 59,528, respectively), and B systems had the lowest (3,688 and 8,713, respectively). Because our analyses of matrix differences and correlations with service recipient and study-level variables were reflective of thousands of individuals and tens of thousands of transition observations, we discuss moderate-to-high effect sizes despite the fact that they may have been associated with moderate p values. As Cohen (
43) noted, “[T]he primary product of a research inquiry [should be] one or more measures of effect size, not p values.” Moreover, estimates of p values based on Ns for FLs and TP matrix cells ranging from six to 54 almost certainly would have been lower if we had had access to individual-level data.
To focus on the most notable differences among intervention types, systems coded as B were smaller and most likely to have included studies in which service recipient states were based on treatment or service types or locations (60%) and to have extracted data from patient registries (60%). M systems were intermediate in size, least likely to have appeared in studies completed in 2000 or after (43%), most likely to include descriptive studies (71%), and most likely to have used ratings data from research or evaluations (71%). R systems were the largest in size and most likely to have appeared in studies appearing or completed in 2000 or after (71%) and to have based states on FL ratings (43%).
With respect to Markov properties, stationarity was empirically disconfirmed for one (20%) B study-level system (
23), for three (43%) M systems (
1,
24), and for one (14%) R system (
23) [see
online supplement]. In all studies in which stationarity was not confirmed, the cause was a subgroup of service recipients who tended to transition less than others (
1,
23,
24) thereby increasing the proportions of persons in “static” TP cells. No study provided actual analyses of how changers differed from nonchangers. However, study authors speculated that nonchangers might be persons with certain diagnoses, older persons, or persons adhering to patterns of previous service use.
Testing whether TPs had first-order properties was done as follows. Testing for one of the B systems (20%) showed that the fit between expected and observed transitions was greater if persons were grouped into those who transitioned more and less frequently (
23). For two of the M systems (29%), testing indicated that a second-order model based on prior service utilization fit the observed cell value data better than a first-order model (
24). For three of the R systems (43%), goodness-of-fit tests for two systems (
25,
26) supported first-order properties while one system (
23) was consistent with second-order properties. Despite the fact that some studies suggested higher-order models, only one (
24) reported second-order Markov TPs. Predictive validity was found for all systems in which this was tested (
23,
26,
27).
For no clinical or sociodemographic variable were data presented for all study-level systems (
Table 3). Compared with R and M systems, B systems had higher percentages of service recipients with diagnoses of schizophrenia or related disorders, service recipients were slightly older, and the percentage of males was higher. Data on the remaining variables were too sparse for comment.
Synthesized TP Matrices
For each system type and TP cell,
Table 4 summarizes the numbers of observations on which TPs were based along with the numbers of systems represented in the cell. ND indicates no data found for TPs to death. Also shown with asterisks are cells for which the probability of Q, a measure of interstudy heterogeneity, was less than the equivalent of .05 adjusted for the large number of comparisons. A total of 53 cells (35%) with data were found to have adjusted Qs with p values equivalent to <.05. These Q values probably reflect a mixture of true differences in services and service recipients between systems correctly coded as similar and a coding error.
Looking across system types, TPs have common features, many found in earlier studies. First, the most common one-month TP was to the same FL. Without new arrivals a static TP of .938, the highest in the table, would leave only 46% of original persons in that FL after 12 months. Next, for TPs from FL 3 and above (non-“acute” FLs), the next-highest TPs grouped by whether they were forward or backward were to immediately adjacent FLs (
1). A high proportion of large changes should not be expected in short time periods. TPs from FLs 1 and 2, more “acute” states, had more variable destinations, suggesting that positive symptoms respond to medication more quickly, returning persons to a variety of baseline FLs, whereas negative symptoms and behaviors that cause people to be categorized at FL 3 and above require remediation by slower-acting psychosocial interventions.
For all but one FL group—persons originating from FL 6 in R systems—disappearance and death are the only ways persons exit mental health systems. Without disappearances, the number of persons in systems increases continuously, straining system capacities. Because of this Levin (
44) has suggested that disappearance rates may be “the solution, not the problem” in providing care to meet expressed demand. Except for persons at FL 6 in R systems, there was no evidence of movement to independence from the system. Although it is possible that some persons disappearing from systems had become system independent without its being recognized, the evidence suggests that even with our most effective services, “graduating people” to system independence is a rare event that needs to be better understood. Backward movement was present for all FLs in all types of systems. Not all services work for all recipients all the time. Systems must make provisions for recipients for whom first-line services do not work or have adverse consequences.
Table 5 shows that predicted ANPTPs for systems coded as R were greater than those for B (p=.01) and M (p=.02) systems. As predicted, average probabilities of transitioning to disappearance and for remaining the same were lower for R systems compared with B and M systems, although these differences were generally small, and sign tests indicated that these differences could have occurred by chance. Findings for backward movement, shown in
Table 5, contrary to predictions, show that the means of TPs indicating backward movement were higher for R systems compared with B and M systems in many comparisons. Once again, sign tests showed that these differences could have occurred by chance. Nevertheless, these post hoc findings are interesting and may indicate a negative effect of “high expectation” programs on some service recipients, a finding in previous research on the effect of expectations on outcomes (
45–
48).
Predictions for M systems compared with B systems were in expected directions only for static rates. Differences were small, and sign tests indicated that the differences observed probably occurred by chance.
Meta-Regressions for Service Recipient and Study Variables
Service recipient variables.
The high number of cells with values of Q unlikely by chance suggests that TPs were influenced by factors in addition to originating FL and study-level system type. Although these Q values may partially reflect coding errors, several studies have suggested that differences between interventions might also be attributable to service recipient variables (
1,
23,
24). ANPTPs were regressed on service recipient variables to explore this possibility.
Systems consisting solely of persons with a diagnosis of schizophrenia or related disorders had lower ANPTPs than diagnostically mixed subgroups (t=−1.99, df=12, one tailed p=.03) [see online supplement]. Systems with more persons classified as white had higher ANPTPs, although the number of systems with data were very small (N=5). Subgroups with more males had higher ANPTPs, but p was above .20. Subgroups formed on the basis of age did not differ. Multicollinearity among variables is possible. Findings also raise the possibility that ANPTP differences between service types may have been moderated by the percentage of service recipients with diagnoses of schizophrenia and by methods with which functional level states were measured. Unfortunately, lack of data prevented further analyses of these possibilities.
Study variables.
Except for the variable study-level state, all sign tests for study variables were two-tailed because our only hypothesis for these was that TPs based on functioning and symptoms would be higher than TPs based on services type or location, which could be constrained by service availability. The ANPTP for functioning- or symptom-based TPs was almost twice the size of the one for service-based TPs (.059 versus .030), suggesting that TPs based on service use should be considered low-side estimates of movement, although the value of p was .30. Correlations between ANPTPs and stationarity and predictive validity testing and publication date had p values above .30, giving no reason to believe that these variables influenced ANPTPs.
Discussion and Conclusions
Discrete states from diverse studies can be aligned with a common FL framework, making synthesis of TPs possible. Mental health systems described in diverse studies can be usefully characterized as B, M, and R. Although findings for interstudy heterogeneity suggest that systems may be further subdivided, current data are insufficient for this. First-order Markov TPs are highly informative ways to represent system outcomes, although in some cases it may be desirable to characterize persons more complexly, by calculating TPs for subgroups or estimating higher-order TPs.
As hypothesized, R systems generated better outcomes than B or M systems, except that R systems also produced more backward movement, suggesting negative effects of high-expectation systems on some persons. The ubiquity of backward movements suggests that all types of systems should include services for persons who do not respond to or who are negatively affected by first-line services. Contrary to our hypotheses, M service packages did not outperform B ones in expected ways. Further research into this finding is needed, especially because M systems are common. Consistent with a theory that one size does not fit all, all systems produced diverse and complex outcomes for all FLs, with some probabilities of forward and backward movements, stasis, and disappearances. We did not find and should not expect to find “magic bullet” systems for persons with serious mental illness that produce only positive outcomes for all persons all the time. Again, all types of systems should include services for service recipients who do not respond to or are negatively affected by first-line services.
Most studies lacked TP information on death, an important omission given concerns about premature mortality (
49,
50). The ways in which these new TPs will affect overall system outcomes are not immediately obvious because of the complex backward and forward nature of the TPs observed and the role that disappearance rates play. Lower disappearance rates for R systems, especially compared with M systems, could substantially increase service use and costs. It will be important to use these TPs in simulation models to explore how interactions between these variables affect system outcomes, service utilization, and costs over time (
11,
51).
Simulation modeling holds promise for increasing the scientific understanding of mental health systems for persons with serious mental illness and for making system planning and implementation more realistic. The TP estimates provided here should be used in simulations to project how outcomes, service utilization, and costs can vary over time with different system configurations. We expect B, M, and R systems will be shown to differ in costs, both in simulation and in empirical studies (a review of empirical costs studies was beyond the scope of this study). Given that R systems are more effective than M and B systems, stakeholders will prefer such systems. However, cost estimates will provide information on the extent to which R systems are affordable and will inform discussions about what system configurations are possible given resource constraints.
If mental health system evaluation and planning through simulation is to progress, researchers need to reduce interstudy heterogeneity by agreeing on a common set of methods and standards for conceptualizing, estimating, and reporting the inputs required by models for implementing and reporting clinical trials (
52). TPs should be based on FL assessments, not on service utilization. TPs to death should be estimated. Disappearance should be studied to clarify its meaning and implications. There should be a common time period for TP assessments: one month seems most reasonable, although “semi-Markov models” with varying time periods are possible and should be explored (
53). Studies should collect and report agreed-upon clinical and sociodemographic data to explore what works (and does not work) for whom and certain methodological features, such as testing for stationarity, should be routinely implemented. In addition, other system attributes thought to be related to performance should be provided and included in analyses—for example, information about how mental health services are financed. The development of such data, methods, and standards will enrich the data available and improve the quality of syntheses for estimating model inputs. This is required for the good science and more realistic and detailed planning that the current crisis in treating persons with serious mental illness demands, especially given the advent of integrated care.