Enhancement of cognition in schizophrenia has become a major public health goal and drug development challenge. It has become increasingly clear that control of psychotic symptoms alone is not sufficient for community adaptation in schizophrenia, and efforts have been made to address unmet needs, including improvements in negative symptoms and cognitive impairment. To stimulate the development of cognition-enhancing drugs for schizophrenia, the National Institute of Mental Health (NIMH) created the Measurement and Treatment Research to Improve Cognition in Schizophrenia (MATRICS) initiative (
1–
3). One key product of the MATRICS initiative was the selection of a standard cognitive battery for use in clinical trials: the MATRICS Consensus Cognitive Battery (MCCB) (
4–
6).
Typically the U.S. Food and Drug Administration (FDA) requires that a new drug improve a single outcome measure to receive approval for marketing. However, there are specific situations in which a single outcome measure is considered insufficient because the clinical endpoint is multifaceted. In such cases, approval may depend on the drug improving two complementary measures that reflect different aspects of treatment response. These complementary measures are referred to as “coprimary outcomes.” The FDA has applied this coprimary requirement to drugs for improvement of cognition in schizophrenia: a drug for this use must show improvement both in cognitive performance and on a functionally meaningful measure in clinical trials (
1). The FDA decided that measurement of cognitive performance with an accepted performance battery could not serve as a sole endpoint because it did not have an obvious and intuitive connection to improved overall outcome.
Although the endpoint for cognitive performance went through an extensive consensus process (
2,
3), the FDA did not provide firm guidance on the definition of a functionally meaningful coprimary outcome measure. Community functioning is a logical coprimary measure, but it is not practical because it is unlikely to change in the course of a typical clinical trial. In the context of clinical trials of cognition-enhancing drugs, practical coprimary measures might be intermediate between cognitive performance and daily functioning-for example, performance-based simulations of daily activities or interviews for cognition (see below). The FDA requirement for a coprimary measure presents a challenge to the field because of the absence of validated measures for this purpose.
MATRICS Coprimary and Translation (MATRICS-CT) is an NIMH initiative to further facilitate development of pharmacological agents for cognitive impairments in schizophrenia (
7). A partnership of pharmaceutical companies supports it through donations to the Foundation at National Institutes of Health to address two remaining issues from the MATRICS initiative. One is to evaluate potential coprimary measures using a consensus process and collection of empirical data; these findings are summarized in this article. The second is to develop and validate high-quality translations of the MCCB for international trials (see
http://www.matricsinc.org).
Potential coprimary measures use two approaches: performance-based and interview-based assessments. Performance-based (also called functional capacity) measures assess capacity to perform key tasks of daily living by asking participants to simulate real-world activities such as holding a social conversation, selecting grocery items to prepare a meal, and planning a trip using public transportation (
8,
9). Good performance on such measures means that the person has the ability to perform the task, but not necessarily that they will perform the task in the community. Interview-based approaches ask people to estimate their cognitive abilities or the extent to which their daily lives are affected by cognitive impairment. Recently, cognitive assessment interviews have been developed specifically for patients with schizophrenia (
10,
11).
An initial examination of coprimary measures was conducted in the MATRICS Psychometric and Standardization Study (PASS). The coprimary measures included in that study were suggested by the MATRICS Outcome Committee (A. Bellack, chair) but did not undergo a systematic selection process, as that was not the primary goal (
12). MATRICS-CT sponsored a validation study called the Validation of Intermediate Measures (VIM) Study specifically to evaluate potential coprimary measures that were systematically selected. The study was designed and overseen by the MATRICS-CT VIM Committee, the members of which are coauthors of this article. The study aims were to examine 1) the psychometric characteristics of selected coprimary measures (test-retest reliability, interrater reliability, and utility as a repeated measure), 2) the validity of the measures (correlation with cognitive performance and community functioning), and 3) the measures' practicality and tolerability (ease of setup, tester training and scoring, amount of missing data, assessment duration, and subject satisfaction ratings).
Method
Study Design and Participants
Clinical interviews were administered to determine eligibility and to assess cognitive performance and community functioning at baseline. Assessments on all coprimary measures and clinical symptoms were completed at baseline and at a 4-week follow-up.
To be eligible, participants had to be outpatients 18–60 years of age with a DSM-IV diagnosis of schizophrenia based on a diagnostic interview for this study or a previous one; have an understanding of spoken English adequate to comprehend testing procedures; have the ability to comprehend the consent form; and not have previously received the performance-based intermediate measures in this study, the MCCB, or similar cognitive assessment, within 6 months of study entry. Participants had to be clinically stable, as indicated by having no significant psychotropic medication changes in the past 2 months and none anticipated for the next month; showing evidence of stable symptomatology for at least 3 months; having Positive and Negative Syndrome Scale (PANSS) scores ≤ 4 (moderate) on P1 (delusions), P2 (conceptual disorganization), P3 (hallucinatory behavior), P5 (grandiosity), P6 (suspiciousness), and G8 (unusual thought content); having a PANSS score ≤ 15 on the negative symptoms subscale; and showing evidence that mood symptoms, if present, had been stable for at least 3 months.
Exclusion criteria were alcohol or other substance dependence in the past 6 months; alcohol or other substance abuse in the past 3 months; clinically significant neurological disease; head injury with loss of consciousness for more than 1 hour; a current medical condition that would interfere with valid assessment; dystonia or parkinsonism that would affect the validity of assessment; pregnancy or nursing; and current use of clozapine, potentially procognitive medications, antidementia medications, amphetamine, lithium, monoamine oxidase inhibitors, or tricyclic antidepressants. No benzodiazepines, sedatives, or anticholinergic medications were administered within 12 hours of assessment. After receiving a complete description of the study, participants provided written informed consent.
Sites
Four sites were selected by the VIM Committee. Each site had extensive experience in conducting schizophrenia clinical trials and local expertise in cognitive and performance-based assessment. Two academic sites (UCLA/Greater Los Angeles VA Healthcare System, the coordinating site in Los Angeles, and Harvard/Beth Israel Deaconess in Boston) and two freestanding clinical trial sites (Collaborative Neuroscience Network in Garden Grove, Calif., and Uptown Research Institute in Chicago) were selected.
Selection of Coprimary Measures
The selection process for study measures was modeled on the consensus and RAND panel process used by MATRICS for MCCB measure selection (see Figure S1 in the data supplement that accompanies the online edition of this article). Briefly, a MATRICS-CT subcommittee (K. Nuechterlein, chair) determined key criteria for selection of coprimary measures. Nominations were solicited broadly through announcements and e-mailings for measures that were categorized either as performance- or interview-based. The VIM Committee selected a subset of nominated measures for further evaluation. A comprehensive database of selected measures was developed by UCLA staff according to the evaluation criteria. The VIM Committee convened a RAND panel meeting in February 2008 (
13) to review the database and make recommendations for the VIM study. The RAND panelists (see acknowledgments) were selected as excellent representatives of their respective areas of expertise and for absence of any conflict of interest with the coprimary measures under consideration. Based on the RAND panel ratings and discussion, the VIM Committee selected the study measures described below.
Performance-Based Measures
Independent Living Scales (ILS)
The ILS assesses adults' competence in instrumental activities of daily living (
14). The items, which target situations relevant to independent living, require the examinee to solve problems, demonstrate knowledge, or perform a task. The ILS includes 70 items in five subscales: memory/orientation, managing money, managing home and transportation, health and safety, and social adjustment. The test yields two factors: problem solving, comprising primarily items that require knowledge of relevant facts, abstract reasoning, and problem solving ability; and performance/information, comprising primarily items that require general knowledge, short-term memory, and performance of simple, everyday tasks. The full scale score is a standardized score with a mean of 100 and a standard deviation of 15.
Test of Adaptive Behavior in Schizophrenia (TABS)
The TABS includes five test areas (medication management, empty bathroom, shopping skills, clothes closet, and work and productivity) and one observed area (social skills) to assess skills needed for daily functioning (
15). It focuses on initiation and problem identification. Props are used, such as pill containers in the medication management component and doll clothing in the clothes closet component. TABS scores are calculated as percent correct for each area; thus, the scores range from 0 to 100 per area. The total score is the mean of the six areas.
UCSD Performance-Based Skills Assessment (UPSA)
The UPSA was designed to assess ability to perform everyday tasks needed for independent community functioning (
16,
17). The UPSA evaluates five areas: household chores, communication, finance, transportation, and planning and recreational activities. It uses role play tasks that are administered as simulations of events that the person may encounter in the community. Raw scores from each subtest are transformed to yield comparable scores (ranging from 0 to 20) for each and a summary score ranging from 0 to 100 (higher scores reflect better performance).
Short Forms
The three performance-based measures were evaluated in their full forms. In addition, the TABS and UPSA had a short form, and the subtests that comprised the short forms were administered first to allow separate evaluation. The TABS short form included the medication and work and productivity subtests; the UPSA short form included the communication and finance components. Administering the short forms of these tests saves an estimated 15 minutes. The ILS does not have an identifiable short form, but the instrument yields two factor scores (performance and problem solving) that were evaluated separately.
Interview-Based Measures
Cognitive Assessment Interview (CAI)
The CAI (
18) is derived from two interview-based instruments, the Clinical Global Impression Scale for Cognition (
11) and the Schizophrenia Cognition Rating Scale (
10). Item response theory was used to select the items that provided the most information about the latent construct of interest, namely, interview-assessed cognitive deficit. From the original 41 items in both scales, 10 were selected that performed best across analyses, at several levels of the cognitive deficit construct, and showed good internal consistency. The CAI includes items that assess six of the seven MATRICS cognitive domains (all except visual learning). Items are rated on a 7-point scale with defined anchor points referenced to healthy people of similar educational and sociocultural background. Higher scores reflect greater cognitive deficits that affect everyday functioning and/or greater need for support in performing those functions. In addition to the total score (the sum of the 10 items), which was the dependent measure in this study, the CAI includes a global rating of cognition (on a 100-point scale) similar to a Global Assessment of Functioning score (
19).
Clinical Global Impression Scale for Cognition (CGI-Cognition)
In an experimental component of the VIM study, a single-item 7-point scale was included to assess whether clinical raters can reliably rate cognitive impairment solely on the basis of a clinical symptom interview. The CGI-Cognition was modeled on the commonly used Clinical Global Impressions scale for symptom severity and did not include a manual or detailed anchor points.
Additional Measures
The VIM study included three additional measures. The MCCB global composite score (
4,
5) was included to examine the relationship of the coprimary measures to cognitive performance. The Heinrichs-Carpenter Quality of Life Scale summary score (
20) was used to examine relationships with community functioning. This score excluded seven intrapsychic items that assess negative symptoms. The PANSS was used to evaluate psychopathology (
21).
Training on each of the coprimary measures was conducted at an in-person start-up meeting and in subsequent teleconferences. Trainees had prior experience working with schizophrenia patients. Training on the CAI, TABS, and UPSA was conducted by a developer of the measure. ILS training was provided by an investigator with extensive experience with that measure.
To ensure independence of the ratings, each site used at least three raters. One determined study eligibility, collected demographic information, and administered the PANSS, CGI-Cognition, and Quality of Life Scale at baseline and the PANSS and CGI-Cognition at week 4. A second completed the performance-based intermediate measures at both assessment points and the MCCB at baseline, and a third rater completed the CAI at both assessment points. Assessments were completed with information from participants only; although informants' ratings can be included in the CAI and Quality of Life Scale, they were not in this study to approximate the likely conditions of multisite clinical trials. The performance-based measures were administered in a counterbalanced order to allow examination of order effects.
Results
A total of 196 participants gave consent and were screened. Of these, 166 received baseline assessments (27 were ineligible; three withdrew consent). Three patients were excluded because of invalid data, leaving 163 participants with valid baseline assessments. Of these, 144 (88.3%) were tested at the 4-week follow-up.
The sample was comparable to other clinically stable samples in schizophrenia trials in gender, age, education, and symptom severity (
22,
23). The sample was about one-third female (N=58, 35.6%), with a mean age of 43.9 years (SD=10.1) and a mean of 12.3 years of education (SD=2.1). The mean duration of illness was 20.3 years (SD=10.6, range=1–43). Participants were clinically stable over the course of the study; the mean total PANSS score was 61.5 (SD=12.6) at baseline and 61.8 (SD=13.2) at 4 weeks.
The summary data for the main variables are presented in
Table 1. All scores were inspected for range and distribution, and no transformations were deemed necessary. The MCCB score indicates that the sample was slightly more than two standard deviations below the mean for age- and gender-corrected norms, comparable to the MATRICS PASS study (
24). There were no significant order effects.
Key Scientific Criteria
In evaluating the measures, an a priori decision was made to prioritize test-retest reliability and correlation with cognitive performance, characteristics that are considered to be most important for a coprimary measure in trials of cognition-enhancing drugs. Test-retest reliability translates directly into power estimates and sample size requirement, and it was considered the most important test characteristic during the MATRICS initiative (
3). Correlation with cognitive performance was also considered essential given the definition of a coprimary measure: for a drug to receive FDA approval for cognition enhancement, it has to significantly improve both cognitive performance and the coprimary measure. Hence, correlation of the two outcome measures is viewed as desirable so that a drug would not need to affect two independent constructs.
Psychometrics and validity data are presented in
Table 2. Regarding test-retest reliability, 0.70 is a conventional cutoff for acceptability for measures of this type. The ILS, CAI, and UPSA had intraclass correlation coefficients (ICCs) above this value, and the TABS and CGI-Cognition just below (ICC=0.69). For the short forms, test-retest reliability values were all very close to 0.70. In addition to test-retest reliability, interrater agreement is an important consideration for the CAI. Interrater reliability (ICC=0.73) was determined on the basis of a set of eight tapes that were rated by all study CAI raters.
For correlation with cognitive performance, there were considerable differences among the full measures. Overall, the performance measures had much greater overlap with cognitive performance (up to 45% shared variance with the UPSA) than the interview-based measures (5% and 14% shared variance for the CAI and CGI-Cognition, respectively). Paired contrasts are reported in a footnote to
Table 2. For the short forms, three of the measures were identical (28% shared variance); the ILS problem solving component was lower (15%).
Additional Scientific Criteria
Table 2 also presents data for utility as a repeated measure and relationship to community functioning. All of the measures had small to modest practice effects (the largest effect sizes were 0.24 for both full measures and short forms). The number of scores at floor or ceiling was not considered to be problematic for any measure at either testing time. Correlation with community functioning was generally low for all measures. The CGI-Cognition and the short forms of the TABS and UPSA had the lowest correlations, in the range of 0.12–0.15. Other measures were in the 0.23–0.30 range.
Practicality and Tolerability
Data on practicality and tolerability were collected only on full measures (
Table 3), and they were not collected on the CGI-Cognition, which is a single-item measure. Practicality and tolerability were each rated on a 7-point scale, with higher scores indicating better ratings. Practicality was rated by five testers for ease of set-up, administration, and scoring. Study participants rated tolerability based on how pleasant or unpleasant they found the test to be. Practicality showed a range across measures, with the UPSA scoring the best and the CAI the poorest. All of the measures were well tolerated by patients, with mean scores ranging from 5.4 to 6.0. The administration times differed notably across measures; the ILS took the longest time (46 minutes), and the CAI the shortest (25 minutes). Each administration time was statistically different from every other. The amount of missing data was very small for all measures (1.6% for the TABS, 0.6% for the UPSA, and none for the others, across both assessments).
Relationships to Symptoms
Table 4 presents the correlations with the five factors of the PANSS (
25). In general, the correlations with positive and negative symptoms were relatively low. The largest correlations were seen with the disorganized thought factor. The measure with the largest correlation with clinical symptoms was the CGI-Cognition, which was completed by the PANSS rater.
Site Differences
One-way analyses of variance were conducted to examine site effects. There were no site differences in age and illness chronicity, but there were in education (F=8.72, df=3, 162, p<0.001) and PANSS total score (F=59.02, df=3, 162, p<0.001). Participants at the two academic sites were more educated and less symptomatic than those at the two freestanding sites. There were site differences on all coprimary measures (ILS: F=7.81, df=3, 163, p<0.001; TABS: F=4.38, df=3, 159, p=0.005; UPSA: F=5.86, df=3, 161, p=0.001; CAI: F=23.86, df=3, 162, p<0.001; CGI-Cognition: F=4.79, df=3, 162, p=0.003). Differences for the three performance-based measures were mainly due to better performance among the patients at UCLA; differences in the interview-based measures were primarily due to greater impairment among participants at Collaborative Neuroscience Network.
Discussion
In this four-site study, we examined the psychometrics, validity, and practicality of candidate performance- and interview-based measures for coprimary outcome in clinical trials of cognition enhancement in schizophrenia. Assessments were conducted at baseline on 163 participants and on more than 88% at 4-week follow-up. Full measures and short forms were evaluated separately. The full forms of the performance-based measures performed the best overall.
Regarding the key scientific criteria, all of the full measures and short forms had acceptable test-retest reliability, with correlations around 0.70, and the ILS, CAI, and UPSA exceeded that threshold. For the interview-based CAI, interrater reliability was an additional source of variability. VIM study raters received extensive, ongoing training by one of the CAI developers, and the same rater completed both ratings 88% of the time. Studies that use the same rater less frequently should expect lower levels of reliability than those reported here.
Regarding the relationship of these measures to cognitive performance, there were significant and substantial differences among them, with the performance-based measures showing much more overlap (shared variance of 26%–45%) compared with the interview-based measures (5%–14%). With only 5% shared variance between the CAI and the MCCB, it appears that the CAI primarily measures a different construct than cognitive performance.
Differences among measures were not as pronounced for the secondary scientific criteria (utility as a repeated measure and community functioning), and all of the measures were well tolerated. Correlations with community functioning were relatively low for all measures, perhaps in part because of the absence of informant ratings for the Quality of Life Scale. The results also suggest that being able to perform an activity does not necessarily mean that the person does so in the community. The full measures differed substantially in administration time, with the ILS taking the longest.
Based on these data, the VIM Committee considered the UPSA to be the leading coprimary measure among the full measures because it had several strong features: good test-retest reliability, excellent shared variance with cognitive performance, good utility as a repeated measure with no problematic floor or ceiling effects, and reasonable tolerability and practicality. Among the short forms, three of the measures performed comparably across the criteria (the TABS, the UPSA, and the ILS performance factor). The committee considered the TABS and UPSA short forms to have an advantage over the ILS performance factor because the short forms are self-contained. The ILS performance factor includes items administered throughout the test, and it is not known whether the same psychometric properties and validity would be obtained if those items were administered without the rest of the test.
The study findings reveal the inherent trade-offs in using short forms of the coprimary measures. Although these short forms save some administration time, they have lower reliability and lower shared variance with cognitive functioning. For the UPSA, the difference in reliability between short form and full measure (ICC=0.69 compared with ICC=0.74) could translate into a notable difference in the sample size required for adequate statistical power, depending on the study design and covariates. Similarly, the difference in shared variance with cognitive performance (28% compared with 45%) influences the confidence that improvement in cognitive performance would be accompanied by improvement in the coprimary measure.
One limitation of this study is that it was conducted in the United States and used English-language versions of the tests. The RAND panel commented that some nominated coprimary tests would be particularly hard to adapt in other cultures (e.g., those that involve videotaped stimuli or specific skills training components), and these tests were not included in the VIM study. There are data indicating that modified versions of coprimary tests perform similarly across different Western cultures (
26), but we do not know the extent to which these measures will need modification for broader international applications. A separate component of MATRICS-CT conducted international surveys with clinicians to start to address the question of global adaptation. Another limitation is that the study was not a treatment trial and therefore was not designed to assess the sensitivity of coprimary measures to change in the context of cognitive enhancement. Nonetheless, the VIM study provides psychometric and validity data on the coprimary measures that suggest the likelihood of detecting underlying changes in cognition when such changes occur.
Placed in the broader context of the MATRICS and MATRICS-CT initiatives, this study is an important methodological step in a pathway to FDA drug approval for critical unmet needs in schizophrenia. The rationale for these initiatives is that substantial gains in recovery from schizophrenia will require treatments for cognitive impairment and negative symptoms. The pathway for drug approval for cognition has specific requirements: a scientific consensus about definition and measurement of cognitive performance and a coprimary measure with more face validity for patient improvement. Unlike the situation with cognitive performance measures, the FDA has not taken the position that a single coprimary measure be identified. The data from this study provide guidance for selection among currently available measures, as well as guidance for development of new coprimary measures in this rapidly developing area. From the clinician's perspective, the study serves as a reminder that schizophrenia patients who live in the community with relatively few and stable symptoms have substantial cognitive impairments that are correlated with difficulties in performing clinically meaningful daily tasks, such as planning an outing to the park or using the telephone to make an appointment. Development of treatments for these impairments is an urgent need.
Acknowledgments
The authors thank the members of the MATRICS-CT (Co-Primary and Translation) Scientific Board, which consisted of representatives from academia, the pharmaceutical industry, NIMH, and the Foundation at NIH. The board provided excellent input and guidance for this study. The authors also thank the members of the MATRICS-CT Rand Panel who are not listed as authors on this paper: Drs. Deanna Barch, John Brekke, Judith Cook, Patrick Corrigan, Michael Egan, Helena Kraemer, William Lawson, Andy Leon, Steve Romano, and Sophia Vinogradov.