In comparative effectiveness studies, treatment effects are calculated by contrasting outcomes between patients who have been assigned to different treatment groups. While randomized treatment assignments are preferred, constraints on resources, timeliness of results, ethical concerns, low frequency of outcomes, and demands for patient subgroup analyses often lead psychiatric investigators to rely on observational data after patients and their physicians have self-selected their treatments (
1). With observational studies, researchers must account for how patients select their treatment to properly adjust estimates of treatment effects that control for these selection biases.
In this issue of the
Journal, Leon et al. (
2) provide a strong case for an informative observational study. Their analysis was conducted to further test the 2009 Food and Drug Administration warnings that suicidal behavior could accompany the use of antiepileptic medications. To account for selection biases in this observational group, the authors applied an advanced statistical method called “propensity scoring” (
3,
4). When the authors analyzed panel data for 199 participants with bipolar disorder who were followed for 30 years, they found no association between antiepileptic medication use and risk of suicide attempts or completed suicides.
The natural question for the practitioner is: Do these analytic techniques lead to scientifically valid findings that can guide clinical decisions? To make this judgment, it might be helpful to explain how statisticians control for selection biases.
Leon et al. (
2) computed a treatment effect size by comparing suicidal outcomes (rate of suicide attempt or suicide) between treatment groups (patients who were exposed and who were not exposed to antiepileptic medications) after adjusting for differences in patient demographic and clinical factors. To yield valid findings, these adjustments must be based on all relevant confounding factors and computed using a correctly specified outcomes model.
To be relevant, confounding factors must 1) vary across treatment groups and 2) be expected to have an impact on patient outcomes directly. Randomized treatment assignments that yield equivalent treatment groups are said to be unconditionally “exogenous.” Thus, no factor will vary across treatment groups, and calculating effect sizes is reduced to simple comparisons of outcomes across treatment groups. However, when patients and their physicians self-select treatments, treatment groups are not expected to be equivalent. Researchers analyzing the outcomes must then identify all confounding factors (e.g., in the Leon et al. study, clinical and demographic characteristics), determine covariates from the data set to measure these confounding factors (e.g., in this case, prior symptom severity, suicidal behaviors, and comorbidities as clinical factors and socioeconomic status, marital status, age, and gender as demographic factors), and then specify an outcomes model to compute effect sizes that are adjusted for these covariates (e.g., here, a mixed-effect, grouped-time survival model).
Outcomes models specify outcome as the dependent variable (here, time between initial period and onset of suicidal behavior, if any) and the confounding covariates, along with a treatment indicator variable, as the independent variables. Treatment indicators assume a value of 1 when the patient selects the treatment of interest (e.g., exposed to antiepileptic in an initial period), and zero otherwise (e.g., not exposed). The outcomes model is fitted to the data set, and effect size is computed from estimates of the model parameters.
Few medical data sets will contain all relevant confounding factors (e.g., patient access to the means to commit suicide, patient access to psychiatric care for symptom relief). To account for these unobserved covariates, instrumental variables (
5) are added to the list of independent variables in the outcomes model (e.g., here, the geographic location of patient residence, reflecting variations in gun regulations, drug trafficking enforcement, and availability of psychiatric services). Instruments must be observable in the data set, vary by treatment group, and be associated with one or more of the unobserved confounding factors. Unlike covariates, instruments are not expected to directly drive patient outcomes. Thus, any association observable in the data set between an instrumental variable and outcomes variables can be attributed to the instrument's association with one or more unobserved factors. If the observable covariates and instrumental variables included in the outcomes model reflect all relevant confounding factors, we say that the treatment assignment is exogenous conditional to the data, or “conditionally exogenous.”
The second problem is how to specify the outcomes model. Outcomes models that do not reflect the data set's true “data-generating process” are said to be
misspecified (
6). Adjusting for confounding factors using misspecified models could also lead to incorrect estimates of effect size (
7).
To solve both exogeneity and specification problems in their outcomes model, Leon et al. summarized both covariates and instruments into a single score. This score was estimated by fitting a second model to the data set. Unlike outcomes models, these propensity models are designed to predict treatment assignment (e.g., exposed or not exposed to treatment with antiepileptics during the initial period) with covariates and instruments as independent variables. Effect sizes are computed by comparing outcomes between exposed and unexposed patients who have been matched by their respective propensity scores.
There are advantages to the Leon et al. approach. Combining covariates and instruments into a single score 1) reduces the number of free parameters in the outcomes model and thus increases power to detect treatment effect sizes; 2) permits more variables to be included in the analyses of small sample sizes; and 3) reduces the exogeneity problem to searching for variables that predict treatment assignment and the specification problem to determining how patients should be divided into discrete propensity groups.
But these advantages do not come without a price. The more successfully the propensity model predicts treatment assignment, the less likely it will be to find untreated and treated patients with matchable propensity scores (e.g., in the Leon et al. study, 21% of sampled patient-time intervals could not be matched). Replacing covariates and instruments by a single score may introduce a misspecification error because the impact of each variable on outcomes is assessed only through its association with the propensity score. When the study's purpose is to determine whether exposure to antiepileptic medication increases hazard rates for suicidal behaviors, what is needed is the propensity for suicidal behaviors, rather than the propensity for medication exposure. For instance, both severe symptoms and low socioeconomic status are positively associated with suicidal behaviors (
8), while severe symptoms but high socioeconomic status often drive the decision to use medication (
2). If these characteristics hold, then low-socioeconomic-status patients with severe symptoms would have a very different initial suicidal behavior profile than their high-socioeconomic-status counterparts with mild symptoms, although the two groups may have comparable propensity scores.
While citing prior successes is informative, findings should be tested for robustness each time an analytic method is applied to a given data set. Leon et al. did show that results were stable across different approaches to classifying patients into discrete propensity groups. However, more can be done here to help the practitioner judge the validity of the reported findings. For instance, a test for robustness inspired by White and Lu (
9) and Rubin and Thomas (
10) involves recomputing effect size estimates in which exposed and unexposed patients are rematched based on the propensity score
plus one or more selected confounding covariates (e.g., propensity scores and socioeconomic status). Since both matched and rematched estimates are designed to measure the same effect size, any observed difference would allow the investigator to reject the null hypothesis that estimates were robust. By repeating across different sets of selected covariates (e.g., propensity and marital status, propensity and age group), the rematched sample that yields the greatest deviation from the original effect size estimate can be determined and tested for significance by bootstrapping the original data set.
This discussion is intended to point to an “analysis gap” that exists between advanced analytic methods that are known among mathematical and computational statisticians and actual methods that medical researchers apply in observational studies. The Leon et al. study offers a good example of methodologists and clinical investigators working closely together to narrow that gap and apply advanced statistical methods to observational outcome studies. As the National Institutes of Health continues its support for observational studies (
1), medical researchers should, rather than restating theory, reciting prior successes, or limiting results to those computable with a popular commercial software program, comb the statistical literature, apply the best analytic methods for their study purpose, and test the applicability of such methods against their data set. Only then can practitioners have confidence that observational findings are offering correct statistical inferences on the risks and benefits of medical treatments.