Full access

New Research

Published Online: 1 January 2013

DSM-5 Field Trials in the United States and Canada, Part II: Test-Retest Reliability of Selected Categorical Diagnoses

Darrel A. Regier, M.D., M.P.H., William E. Narrow, M.D., M.P.H., Diana E. Clarke, Ph.D., M.Sc., Helena C. Kraemer, Ph.D., S. Janet Kuramoto, Ph.D., M.H.S., Emily A. Kuhl, Ph.D., and David J. Kupfer, M.D.Authors Info & Affiliations

Publication: American Journal of Psychiatry

Volume 170, Number 1

https://doi.org/10.1176/appi.ajp.2012.12070999

PDF/EPUB

Abstract

Objective

The DSM-5 Field Trials were designed to obtain precise (standard error <0.1) estimates of the intraclass kappa as a measure of the degree to which two clinicians could independently agree on the presence or absence of selected DSM-5 diagnoses when the same patient was interviewed on separate occasions, in clinical settings, and evaluated with usual clinical interview methods.

Method

Eleven academic centers in the United States and Canada were selected, and each was assigned several target diagnoses frequently treated in that setting. Consecutive patients visiting a site during the study were screened and stratified on the basis of DSM-IV diagnoses or symptomatic presentations. Patients were randomly assigned to two clinicians for a diagnostic interview; clinicians were blind to any previous diagnosis. All data were entered directly via an Internet-based software system to a secure central server. Detailed research design and statistical methods are presented in an accompanying article.

Results

There were a total of 15 adult and eight child/adolescent diagnoses for which adequate sample sizes were obtained to report adequately precise estimates of the intraclass kappa. Overall, five diagnoses were in the very good range (kappa=0.60–0.79), nine in the good range (kappa=0.40–0.59), six in the questionable range (kappa=0.20–0.39), and three in the unacceptable range (kappa values <0.20). Eight diagnoses had insufficient sample sizes to generate precise kappa estimates at any site.

Conclusions

Most diagnoses adequately tested had good to very good reliability with these representative clinical populations assessed with usual clinical interview methods. Some diagnoses that were revised to encompass a broader spectrum of symptom expression or had a more dimensional approach tested in the good to very good range.

A crucial issue both for clinical decision making and for clinical research progress in psychiatry is the quality of diagnosis, that is, the reliability and the validity of diagnosis. Since the 1970s, the validity of psychiatric diagnoses has largely been supported by expert clinical consensus, based on a wide range of clinical experience and increasingly buttressed by basic, clinical, and epidemiological research (1–3). DSM work groups have been assigned to evaluate all available research evidence for the existing diagnoses and to propose any necessary modifications. More than a decade has been devoted to the DSM-5 expert analysis effort, which has been extended to include a web site invitation (www.dsm5.org) for substantial comments from the interested general public, including clinicians and research investigators. Proposed revisions have been subjected to the field trials reported here and to reviews, independent of the work group members, which focus on their validity and clinical utility.

Following a historic recommendation by Stengel (4) and demonstration that the use of more explicit diagnostic criteria could improve the reliability of international psychiatric diagnosis (5, 6), the National Institute of Mental Health (NIMH) supported a number of research studies (7, 8) using explicit Research Diagnostic Criteria (9, 10). Since then, the major focus of the field trials—those for DSM-III (11), DSM-IV (12), and ICD-10 (13, 14)—has been on the reliability of diagnosis, that is, the degree to which the diagnosis of a mental disorder by one clinician would likely be replicated by a second clinician interviewing the same patient. Although the use of explicit “operational criteria” is essential for obtaining reliable clinical assessments, the reliability of such criteria does not guarantee that they are the most valid representation of an underlying pathological process. Criteria for validating diagnoses have been used for the past 40 years (15), among them tests of diagnostic boundaries to see if they clearly separate disorders from each other, have a common clinical course, have a common response to treatment, cluster in families, and have common laboratory test findings (which more recently may include temperament/personality traits, neurocircuitry imaging, pathophysiology, and genetic markers) (16).

However, if the diagnostic criteria defining a disorder in a given group of patients cannot be assessed reliably by two or more clinicians, then patients with those diagnoses cannot be expected to have common treatment responses or similar etiological and laboratory findings. In this sense, without reliability, there can be no validity of a diagnosis. Thus, reliability studies set the stage for validity studies beyond face/construct validity. On the other hand, it is possible to have reliability without validity—if there are only partial criteria (such as a fever) that can be reliably assessed and there are missing criteria necessary to identify a “valid” homogeneous group with a disease or disorder characterized by a common underlying pathological process, clinical course, or causal agents.

Previous Field Trials

In order to understand the difference between the results of DSM-5 and previous DSM field trials, it is critical to note the difference in conceptual approach and design of these trials. With DSM-III (17), the major focus was on demonstrating that the introduction of explicit diagnostic criteria, as recommended by Stengel (4), could demonstrably improve reliability—the degree to which two or more clinicians could agree on a diagnostic assessment. The focus on reliability has been a highly successful innovation for psychiatry and has clearly advanced clinical communication and the feasibility of more productive research investigations.

Critical to assessment of the reliability of DSM-III diagnoses was the newly developed intraclass kappa, a statistical measure of agreement between two raters that would adjust for chance agreement. The intraclass kappa statistical approach had recently been developed by Fleiss, Spitzer, Endicott, and Cohen specifically to address measurement of the reliability of mental disorder diagnoses (18). Given the state of the science in 1980, limited attention was given to the precision of the kappa statistic—that is, the standard error of any kappa statistic estimate, which was the major focus of the DSM-5 Field Trials (19).

It is noteworthy that both the DSM-5 and DSM-III Field Trials attempted to obtain representative patient samples in general clinical settings, used usual clinical interview approaches (instead of structured research interviews), and had test-retest interviews with two clinicians assessing the same patient on separate occasions—although one-third of the DSM-III patients had interrater assessments with two clinicians rating the same interview conducted by one of them (17).

In order to obtain more precise kappa estimates, the sampling strategy for DSM-5 was designed to ensure the availability for assessment of a greater number of patients with relatively rare disorders in clinical settings than could be obtained from the random sampling strategy used for the DSM-III Field Trials. The stratified sampling approach for DSM-5, described in the accompanying article by Clarke et al. (20), increased the availability of patients with rare diagnoses, although there were still some diagnoses for which it was not possible to obtain an adequate number of patients for analysis. In the DSM-III trials, which used a random sampling strategy, many of the diagnoses had fewer than six patients in any diagnostic group—with others combined into larger groups of schizophrenic, substance use, anxiety, somatoform, sexual, and personality disorders when estimating kappa—with no confidence intervals to assess the precision of the estimates. The ability to assess the precision of the kappa estimates depends on an adequate sample size, which is very difficult to achieve with a random sample (20).

In contrast to the DSM-III Field Trials focus on obtaining a random sample of patients, DSM-IV Field Trials focused on select clinical populations in specialty settings in which diagnoses were made with structured research interviews. The major objective of the DSM-IV Field Trials was to evaluate the changes in prevalence that would occur for these highly selected patients if DSM-III, DSM-III-R, ICD-10, or the proposed DSM-IV criteria were used, with reliability assessments considered less important (21–23). When reliability trials were conducted, many of the interviews were videotaped and interrater assessments of the interview were conducted by two clinicians viewing the identical interview (24).

Our expectation in the DSM-5 Field Trials (20, 25) was that the levels of reliability would be lower than those seen in the DSM-IV Field Trials (which were conducted with carefully selected patients and highly trained clinicians) and more like the reliabilities seen for medical diagnoses assessed in clinical settings. The DSM-5 Field Trials design can be compared to NIH-supported clinical treatment effectiveness trials that focused on representative samples of patients with comorbid diagnoses in clinical settings—in contrast to industry-supported treatment efficacy trials of “pure” disorders used to establish medication indications (26). Inclusion criteria for “pure” disorder groups may be successful in generating good internal validity for the patients studied, but they lack external validity when treatment outcome or kappa reliability estimates are applied to patients with comorbid disorders in usual clinical settings (27).

Method

The DSM-5 Field Trials were designed to evaluate the test-retest reliability, convergent validity, clinical utility, and feasibility of selected DSM-5 diagnoses that were considered to be of high public health importance or were proposed for new additions to the manual. The nature of the clinical population in which the field trials took place, the clinical assessment approach, and the primary objective of obtaining statistically precise estimates of the kappa measure of reliability were all important components of this approach. The latter focus on obtaining kappa estimates for disorders with a standard error of less than 0.1 (a kappa confidence interval size of 0.5 or less) required the use of a stratified sampling design in general clinical populations of consecutive patients. No attempt was made to exclude comorbidities at entry or at either clinical interview or to exclude patients with unclear presentations, which were often excluded from past field trials. This approach was intended to enhance the possibility of examining patients for all expressions of psychopathology as they appear in typical clinical settings.

The objective of this analysis is to document the reliability of the categorical diagnoses obtained by two independent clinicians using their usual clinical interviews followed by use of a computer-assisted checklist to document the presence or absence of the symptomatic criteria needed to support their clinical diagnosis. Detailed information on the site characteristics, study design, patient characteristics, clinician characteristics, diagnostic distributions of eligible clinical populations, and analytic approach is provided in an accompanying article (20). In addition to the focus in this article on the reliability of categorical diagnoses, the DSM-5 Field Trials included dimensional assessments of cross-cutting symptom domains obtained from patients who were able to fill out this “review of mental systems” on a laptop computer or tablet in the waiting room. Results of this review were summarized and made available electronically to the clinician before the clinical interview. The reliability of the patient cross-cutting dimensional ratings in two separate interviews is described in an accompanying article (28).

Site Selection

From a total of 49 applicants responding to an April 2010 request for field trial applications, four child/adolescent sites were selected: Baystate Medical Center (Springfield, Mass.), Colorado Childrens Hospital (Denver), Columbia/Cornell Medical Centers (New York), and Stanford University Hospital (Palo Alto, Calif.). Seven adult sites were selected: the Center for Addiction and Mental Health (Toronto), DeBakey VA Medical Center and Menninger Clinic at Baylor University (Houston), Dallas VA Medical Center (Dallas), Mayo Clinic (Rochester, Minn.), UCLA (Los Angeles), University of Pennsylvania (Philadelphia); and the University of Texas at San Antonio (San Antonio). Human subject clearance was provided by institutional review groups (IRGs) at each university as well as by the IRG of the American Psychiatric Institute for Research and Education of the American Psychiatric Foundation.

The seven adult and four child/adolescent academic institutions applied to study an average of slightly over five diagnoses per site for a total of 60 site-specific trials. These trials examined a total of 33 separate diagnoses—an average of about two separate field trials per diagnosis. This redundancy was important because some sites were unable to obtain the full complement of patients with the specific diagnoses they thought would be feasible to recruit under their field trial contracts. Equally important was the opportunity to examine site-specific variations in the reliability of those diagnoses that could be studied at more than one site.

Results

To illustrate the stratified sample design at each site that enabled calculation of the intraclass kappa reliability results, a summary from the combined Houston VA/Menninger Clinic is provided in Table 1. Column 1 identifies the target DSM-5 diagnoses for this site: PTSD, alcohol use disorder, major depressive disorder, mild traumatic brain injury, borderline personality disorder, and an “other diagnosis” stratum. As seen in column 2, patients could be assigned to these six strata based on their DSM-IV diagnosis or qualifying symptom profile. Column 3 identifies the proportion of the clinic population that qualifies for each stratum; note that these proportions are the same for each target diagnosis row and, because of comorbidity, total >1.0. Column 4 identifies the percentage of the sample assigned for sampling into each stratum (those classified into more than one stratum were assigned for sampling to the rarest stratum at that site). These provide the sampling weights for computations.

TABLE 1. Test-Retest Reliability of Categorical DSM-5 Criteria Tested at Houston VA/Menninger (N=264)

Target DSM-5 Diagnosis	Stratum Assignment^a	Proportion of Clinic Population Qualifying for Stratum	Sample Weight	Numbers in Stratum	Patients With Target Diagnosis^b	Agreement^c	Interpretation^d	DSM-5 Prevalence (95% CI)^e
Posttraumatic stress disorder (PTSD)	PTSD	0.47	0.18	44	43	34	0.79	0.69	0.59–0.78	Very good	0.42 (0.37–0.46)
AUD	0.26	0.19	49	46	17	0.37
MDD	0.34	0.18	62	57	16	0.28
mTBI	0.15	0.15	46	39	37	0.95
BPD	0.13	0.13	46	44	14	0.32
Other	0.16	0.16	40	35	9	0.26

Alcohol use disorder (AUD)	PTSD	0.47	0.18	50	43	21	0.49
AUD	0.26	0.19	47	46	28	0.61	0.40	0.27–0.54	Good	0.29 (0.24–0.33)
MDD	0.34	0.18	62	57	16	0.28
mTBI	0.15	0.15	46	39	19	0.49
BPD	0.13	0.13	46	44	14	0.32
Other	0.16	0.16	40	35	8	0.23

Major depressive disorder (MDD)	PTSD	0.47	0.18	50	43	26	0.60
AUD	0.26	0.19	49	46	27	0.59
MDD	0.34	0.18	60	57	28	0.49	0.25	0.13–0.36	Questionable	0.36 (0.31–0.40)
mTBI	0.15	0.15	46	39	27	0.69
BPD	0.13	0.13	46	44	17	0.39
Other	0.16	0.16	40	35	13	0.37

Mild traumatic brain injury (mTBI)	PTSD	0.47	0.18	50	43	7	0.16
AUD	0.26	0.19	49	46	1	0.02
MDD	0.34	0.18	62	57	0	0.00
mTBI	0.15	0.15	40	39	21	0.54	0.36	0.13–0.55	Questionable	0.08 (0.06–0.10)
BPD	0.13	0.13	46	44	2	0.04
Other	0.16	0.16	40	35	1	0.03

Borderline personality disorder (BPD)	PTSD	0.47	0.18	50	43	4	0.09
AUD	0.26	0.19	49	46	2	0.04
MDD	0.34	0.18	62	57	4	0.07
mTBI	0.15	0.15	46	39	0	0.00
BPD	0.13	0.13	45	44	25	0.57	0.34	0.18–0.51	Questionable	0.08 (0.06–0.10)
Other	0.16	0.16	40	35	3	0.09

Other diagnosis (Other)	PTSD	0.47	0.18	50	43	1	0.02
AUD	0.26	0.19	49	46	16	0.35
MDD	0.34	0.18	62	57	28	0.49
mTBI	0.15	0.15	46	39	2	0.05
BPD	0.13	0.13	46	44	22	0.50
Other	0.16	0.16	36	35	19	0.54	0.34	0.22–0.46	Questionable	0.27 (0.24–0.31)

Looking at the first bolded row for PTSD, note that 47% of the clinic population qualified for PTSD but only 18% of the sample was assigned to this stratum. At the Houston site, PTSD was the most common stratum and was often comorbid with other target diagnoses. Consequently patients with, for example, PTSD and mild traumatic brain injury were assigned to mild traumatic brain injury for sampling, since it was rarer than PTSD (15% compared with 47%). Columns 5 and 6 identify the total number of patients in each stratum who had one visit and two visits, respectively, and the two-visit number is the number evaluated for the reliability of diagnoses. Column 7 identifies the number identified by one or both clinicians with the target DSM-5 diagnosis from each stratum—e.g., 34 of the 43 patients in the PTSD stratum with two visits were diagnosed with DSM-5 PTSD (79% as indicated in column 8) by one or both clinicians, as were 17 from the alcohol use disorder stratum, 16 from the major depressive disorder stratum, 37 from the mild traumatic brain injury stratum, 14 from the borderline personality disorder stratum, and nine from the other diagnosis stratum.

Clinicians participating in the reliability study were encouraged to identify comorbid diagnoses where they existed. Thus the same patient may be counted multiple times in columns 7 and 8. Column 9 provides the intraclass kappa statistic for each disorder (a kappa of 0.69 for PTSD in this setting). Column 10 gives the 95% two-tailed confidence interval (CI) and column 11 suggests the interpretation of the kappa reliability results, as previously documented (25). Finally, column 12 provides the clinic prevalence of the DSM-5 disorder, which can be contrasted in most cases with the DSM-IV prevalence or symptomatic presentation in column 3. Both the intraclass kappa and the prevalence are based on the use of sampling weights. Comparable tables from each site provide a wealth of information about the performance of the target diagnoses and will be provided in more detailed forthcoming analyses of individual sites and diagnoses.

Table 2 provides data for all field trials for adult disorders assessed at the seven sites that were successful in obtaining kappa estimates with a two-tailed 95% CI size ≤0.5. Included in this table for each disorder are the intraclass kappa statistic, 95% CI, interpretation of the kappa, the weighted DSM-IV/symptomatic screening prevalence, and the DSM-5 prevalence (with CI) in the site-specific clinical samples. For disorders that were studied at multiple sites, a pooled kappa is provided with these combined results being emphasized with a notation if the CIs at the multiple sites failed to overlap—indicating the need for caution in the interpretation of the pooled kappa. The disorders are arranged in the order that they will appear in the revised organizational structure of DSM-5.

TABLE 2. Test-Retest Reliability of Target DSM-5 Diagnoses at the Adult Field Trial Sites^a

Target DSM-5 Diagnosis and Field Trial Site	Intraclass Kappa	95% CI	Interpretation	DSM-IV Prevalence	DSM-5 Prevalence (95% CI)
Schizophrenia
CAMH	0.50	0.33–0.64	Good	0.53	0.37 (0.30–0.43)
UTSA	0.39	0.15–0.58	Questionable	0.16	0.13 (0.09–0.16)
Pooled	0.46	0.34–0.59	Good
Schizoaffective disorder (CAMH)	0.50	0.30–0.65	Good	0.14	0.18 (0.14–0.24)
Bipolar I disorder
Mayo	0.73	0.57–0.85	Very good	0.25	0.25 (0.21–0.30)
UTSA	0.27	0.08–0.44	Questionable	0.28	0.28 (0.24–0.33)
Pooled^b	0.56	0.45–0.67	Good
Major depressive disorder
Dallas VA	0.27	0.11–0.43	Questionable	0.49	0.37 (0.31–0.44)
Houston VA/Menninger	0.25	0.13–0.36	Questionable	0.34	0.36 (0.31–0.40)
UCLA	0.42	0.26–0.55	Good	0.26	0.28 (0.23–0.33)
UTSA	0.13	−0.06 to 0.30	Unacceptable	0.21	0.19 (0.15–0.24)
Pooled	0.28	0.20–0.35	Questionable
Mixed anxiety-depressive disorder
Penn	0.19	−0.07 to 0.42	Unacceptable	n/a^c	0.07 (0.05–0.10)
UCLA	−0.04	−0.13 to 0.08	Unacceptable	n/a^c	0.10 (0.07–0.13)
Pooled	−0.004	−0.10 to 0.09	Unacceptable
Generalized anxiety disorder (Penn)	0.20	0.02–0.36	Questionable	0.34	0.20 (0.16–0.24)
Posttraumatic stress disorder
Dallas	0.63	0.48–0.75	Very good	0.50	0.46 (0.40–0.54)
Houston VA/Menninger	0.69	0.59–0.78	Very good	0.47	0.42 (0.37–0.46)
Pooled	0.67	0.59–0.75	Very good
Complex somatic symptom disorder revised (Mayo)	0.61	0.40–0.77	Very good	0.10^d	0.08 (0.06–0.11)
Binge eating disorder (Penn)	0.56	0.32–0.77	Good	n/a^c	0.05 (0.03–0.07)
Alcohol use disorder (Houston VA/Menninger)	0.40	0.27–0.54	Good	0.26^e	0.29 (0.24–0.33)
Mild neurocognitive disorder
Mayo	0.76	0.60–0.88	Very good	n/a^c	0.10 (0.07–0.14)
UCLA	0.18	0.03–0.32	Unacceptable	n/a^c	0.14 (0.10–0.17)
Pooled^b	0.48	0.38–0.58	Good
Major neurocognitive disorder
Mayo	0.75	0.59–0.90	Very good	0.07^f	0.07 (0.05–0.09)
UCLA	0.80^g	0.65–0.90	Very good	0.22^f	0.19 (0.16–0.21)
Pooled	0.78	0.68–0.88	Very good
Mild traumatic brain injury (Houston VA/Menninger)	0.36	0.13–0.55	Questionable	n/a^c	0.08 (0.06–0.10)
Antisocial personality disorder (Dallas VA)	0.21	−0.02 to 0.47	Questionable	0.05	0.03 (0.02–0.04)
Borderline personality disorder
CAMH	0.75	0.55–0.88	Very good	0.06	0.04 (0.03–0.05)
Houston VA/Menninger	0.34	0.18–0.51	Questionable	0.13	0.08 (0.06–0.10)
Pooled^b	0.54	0.43–0.66	Good

Kappa estimates shown are those with standard errors ≤0.1 and 95% CI of sizes ≤0.5.

Since the individual intraclass kappas for the stratified samples and their 95% CIs do not overlap, the pooled intraclass kappa needs to be interpreted with caution.

Not applicable because the diagnosis is new to DSM.

Estimated DSM-IV prevalence represents the DSM-IV diagnosis of any somatoform disorder, excluding conversion and body dysmorphic disorders.

Estimated DSM-IV prevalence represents the DSM-IV diagnosis of alcohol abuse and/or alcohol dependence.

Estimated DSM-IV prevalence represents the DSM-IV diagnosis of any dementia disorder.

The kappa interpretation provided here is based on the nonrounded estimate, which was below 0.80.

A total of 25 trials in these seven adult field trial sites were successful in obtaining adequate samples that permitted reliable kappa estimates for 15 separate diagnoses. Three diagnoses were in the very good (kappa 0.60–0.79) range: PTSD, complex somatic symptom disorder, and major neurocognitive disorder. Seven were in the good (kappa 0.40–0.59) range: schizophrenia, schizoaffective disorder, bipolar I disorder, binge eating disorder, alcohol use disorder, mild neurocognitive disorder, and borderline personality disorder. Four diagnoses were in the questionable (kappa 0.20–0.39) range: major depressive disorder, generalized anxiety disorder, mild traumatic brain injury, and antisocial personality disorder. Finally, one newly proposed diagnosis, mixed anxiety-depressive disorder, was in the unacceptable (kappa <0.20) range.

Table 3 contains results from seven adult trials in which sample sizes were insufficient to obtain a kappa estimate with adequate precision to provide a stable population estimate—i.e., either kappa CI size ≤0.5 or fewer than seven patients recruited into the target stratum. These results are presented in the interest of complete reporting. Even though some of these kappas are in the good range, inadequate precision for the kappa estimates limits their interpretability.

TABLE 3. DSM-5 Adult Field Trials Unsuccessful in Obtaining Accurate Estimates of Kappa^a

Target DSM-5 Diagnosis and Field Trial Site	Intraclass Kappa	95% CI	DSM-IV Prevalence	DSM-5 Prevalence (95% CI)
Attenuated psychotic symptoms syndrome (CAMH)	0.46	0.001–0.81	n/a^b	0.04 (0.02–0.08)
Schizotypal personality disorder (CAMH)	—	—	0.03	0.00
Bipolar II disorder (Mayo)	0.40	0.09–0.64	0.18	0.09 (0.06–0.14)
Hoarding disorder (Penn)	0.59	0.17–0.83	n/a^b	0.05 (0.02–0.08)
Mild neurocognitive disorder (Dallas VA)	0.43	0.12–0.66	n/a^b	0.06 (0.03–0.10)
Mild traumatic brain injury (Dallas VA)	0.68	0.19–0.87	n/a^b	0.04 (0.01–0.07)
Obsessive-compulsive personality disorder (Penn)	0.31	−0.03 to 0.80	0.07	0.02 (0.01–0.04)

An unsuccessful field trial refers to one in which the size of the 95% CI around the reliability coefficient was >0.5, which indicates a lack of precision (SE>0.1) in the estimation of the reliability coefficient (20). Narcissistic personality disorder was assessed at Houston/Menninger, but no data are shown because fewer than seven patients were studied.

Not applicable because the diagnosis is new to DSM.

In general, there was a small decrease in prevalence between the DSM-IV clinical screening diagnoses for entry into the study and the DSM-5 prevalence estimates in these weighted stratified samples. A small but significant decrease was noted for single sites studying schizophrenia, major depression, generalized anxiety disorder, PTSD, major neurocognitive disorder, borderline personality disorder, and bipolar II disorder. Since general population epidemiological studies are never conducted until after a DSM revision is completed, it remains to be seen whether the changes in diagnostic criteria for these disorders will result in any changes in population prevalence in contrast to the shifts in clinical site prevalence in these field trial settings.

Table 4 contains results from 14 trials in four child/adolescent sites that were successful in obtaining adequate samples that permitted precise kappa estimates for eight separate diagnoses. Two were in the very good (kappa=0.60–0.79) range: autism spectrum disorder and ADHD. There were two in the good (0.40–0.59) range: avoidant/restrictive food intake disorder (a newly proposed disorder in DSM-5) and oppositional defiant disorder. Two diagnoses were in the questionable (0.20–0.39) range: major depressive disorder (as in the adult studies) and disruptive mood dysregulation disorder, a new diagnosis proposed for DSM-5 (the latter pooled estimate was heavily influenced by site variability). Finally, two disorders were in the unacceptable (kappa values <0.20) range: mixed anxiety-depressive disorder (as in the adult studies) and nonsuicidal self-injury, a new diagnosis that includes patients with frequent self-inflicted cutting.

TABLE 4. Test-Retest Reliability of Target DSM-5 Diagnoses at the Child/Pediatric Field Trial Sites^a

Target DSM-5 Diagnosis and Field Trial Site	Intraclass Kappa	95% CI	Interpretation	DSM-IV Prevalence	DSM-5 Prevalence (95% CI)
Autism spectrum disorder^b
Baystate	0.66	0.51–0.79	Very good	0.23	0.24 (0.20–0.30)
Stanford	0.72	0.54–0.86	Very good	0.26	0.19 (0.15–0.24)
Pooled	0.69	0.58–0.79	Very good
ADHD
Baystate	0.71	0.56–0.82	Very good	0.59	0.69 (0.62–0.74)
Columbia	0.45	0.29–0.62	Good	0.55	0.58 (0.51–0.65)
Pooled	0.61	0.51–0.71	Very good
Disruptive mood dysregulation disorder
Baystate	0.06	−0.07 to 0.29	Unacceptable	n/a^c	0.05 (0.03–0.08)
Colorado	0.49	0.33–0.66	Good	n/a^c	0.15 (0.11–0.19)
Columbia	0.11	−0.09 to 0.37	Unacceptable	n/a^c	0.08 (0.04–0.12)
Pooled^d	0.25	0.15–0.36	Questionable
Mixed anxiety-depressive disorder
Colorado	0.02	−0.09 to 0.20	Unacceptable	n/a^c	0.07 (0.04–0.09)
Stanford	0.13	−0.04 to 0.45	Unacceptable	n/a^c	0.04 (0.02–0.06)
Pooled	0.05	−0.08 to 0.17	Unacceptable
Major depressive disorder
Colorado	0.33	0.14–0.52	Questionable	0.21	0.12 (0.09–0.15)
Stanford	0.23	0.03–0.41	Questionable	0.21	0.12 (0.08–0.15)
Pooled	0.28	0.15–0.41	Questionable
Avoidant/restrictive food intake disorder (Stanford)	0.48	0.25–0.68	Good	n/a^c	0.11 (0.07–0.15)
Oppositional defiant disorder (Columbia)	0.40	0.18–0.61	Good	0.22	0.17 (0.12–0.22)
Nonsuicidal self-injury (Baystate)	−0.03	−0.05 to −0.01	Unacceptable	n/a^c	0.03 (0.01–0.04)

Kappa estimates shown are those with standard errors ≤0.1 and 95% CI sizes ≤0.5.

For autism spectrum disorder, the estimated DSM-IV prevalence represents the DSM-IV diagnosis of autistic disorder, Asperger’s disorder, or pervasive developmental disorder not otherwise specified.

Not applicable because the diagnosis is new to DSM.

^d Since the individual intraclass kappas for the stratified samples and their 95% CIs do not overlap, the pooled intraclass kappa needs to be interpreted with caution.

Table 5 contains results from four child/adolescent trials in which sample sizes were insufficient to obtain a kappa estimate with adequate precision to provide a stable population estimate. As noted previously, the kappa statistics are reported for completeness, but without further comment because of their lack of precision.

TABLE 5. DSM-5 Child/Pediatric Field Trials Unsuccessful in Obtaining Accurate Estimates of Kappa^a

Target DSM-5 Diagnostic Criteria and Field Trial Site	Intraclass Kappa	95% CI	DSM-IV Prevalence	DSM-5 Prevalence (95% CI)
Bipolar I/II disorder (Baystate)	0.52	0.13–0.80	0.06	0.05 (0.03–0.07)
Posttraumatic stress disorder – child/adolescent (Colorado)	0.34	0.04–0.62	0.14	0.04 (0.02–0.06)
Conduct disorder (Colorado)	0.46	0.16–0.69	0.08	0.08 (0.05–0.12)
Callous/unemotional specifier	0.28	−0.05 to 0.54	n/a^b	0.05 (0.02–0.08)
Nonsuicidal self-injury (Columbia)	0.77	0.25–0.99	n/a^b	0.05 (0.02–0.09)

An unsuccessful field trial refers to one in which the size of the 95% CI around the reliability coefficient was >0.5 which indicates a lack of precision (SE>0.1) in the estimation of the reliability coefficient (20).

Not applicable because the diagnosis is new to DSM.

Comorbidity of Diagnoses

To complement Table 1, which provides detailed results from the Houston VA/Menninger site, Figure 1 provides a graphic presentation of weighted prevalence rates for four disorders—major depressive disorder, posttraumatic stress disorder, alcohol use disorder, and generalized anxiety disorder—as well as the prevalence rates for comorbid combinations of these four disorders. It is instructive to note that a minority of patients had a single or “pure” diagnosis of any of these conditions and that these are the patients who were typically included in the DSM-IV Field Trials and who are typically selected for clinical efficacy trials.

FIGURE 1. Comorbidity of Major Depressive Disorder, Posttraumatic Stress Disorder, Alcohol Use Disorder, and Generalized Anxiety Disorder^a
^a Rates are average weighted percentages from Houston VA/Menninger (N=264).

Discussion

Adult Patient Results

Among the disorders identified in adults with good to very good reliability levels, PTSD was one of the most reliable mental diagnoses in the field trials. Studied in two Veterans Health Administration sites, the slightly revised and reorganized criteria functioned well, with additional detailed evaluations showing close comparability between DSM-IV and DSM-5 diagnoses (data not shown). Major neurocognitive disorder and complex somatic symptom disorder both incorporated multiple previous diagnoses into a broader spectrum with specifiers to identify different presentations. Likewise, alcohol use disorder combined the previous abuse and dependence into a single dimensional rating with 11 symptoms and thresholds of two, four, and six symptoms representing mild, moderate, and severe substance use disorder, respectively. Schizophrenia, schizoaffective disorder, and bipolar disorder had some modifications from DSM-IV that eliminated subtypes of schizophrenia, added greater stability to schizoaffective disorder, and added a “mixed” specifier to bipolar disorder. Borderline personality disorder had a pooled kappa in the good range in this first field trial with personality trait-defined diagnostic criteria. Finally, two new diagnoses, mild neurocognitive disorder and binge eating disorder, performed in the good range for reliability.

In contrast, the newly proposed diagnosis of mixed anxiety-depressive disorder, which is commonly identified in primary care settings, could not be reliably separated from major depression or generalized anxiety disorder in these specialty mental health settings.

However, special attention must be given to understanding why the two established DSM-IV diagnoses of major depressive disorder and generalized anxiety disorder had reliabilities in the questionable range (0.20–0.39). It should be noted that the diagnosis of major depressive disorder has not changed from DSM-IV and that it was possible to analyze the generalized anxiety disorder findings with the DSM-IV criteria and the proposed revisions for DSM-5, which did not substantially change the reliability measures. Thus, the results for these two disorders provide a yardstick to measure the difference in kappa reliability estimates obtained with the different DSM-IV and DSM-5 design and assessment methods. Additional attention will be given to these disorders later in the discussion on comorbidity.

The diagnoses in Table 3 for which adequate sample sizes were not obtained to produce precise kappa estimates with acceptable confidence intervals illustrate the challenge for determining reliability levels for the large majority of diagnoses in DSM. For example, the diagnosis of bipolar II disorder is unchanged from DSM-IV, but sample size was insufficient to assess the kappa. Two newly proposed diagnoses, attenuated psychotic symptom syndrome disorder and hoarding disorder, have a substantial body of research supporting their inclusion in DSM-5, but their reliability could not be assessed with the available samples at large academic settings. Likewise, although samples sizes for the diagnosis of borderline personality disorder were adequate to perform in the good to very good range, the diagnosis of antisocial personality disorder was in the questionable range, and there were insufficient numbers of patients meeting criteria for schizotypal, obsessive-compulsive, and narcissistic personality disorders to assess reliability.

Child/Adolescent Patient Results

Among the disorders identified in child/adolescent patients with good to very good reliability levels, the diagnosis of autism spectrum disorder also incorporated multiple previous diagnoses into a broader spectrum with dimensional ratings to identify different presentations. ADHD was in the very good range, followed by oppositional defiant disorder and the newly proposed diagnosis of avoidant/restrictive food intake disorder—both in the good range.

As with the adult patient tests, the diagnosis of major depressive disorder was in the questionable range and mixed anxiety-depressive disorder was in the unacceptable range. However, disruptive mood dysregulation disorder and nonsuicidal self-injury disorder, two newly proposed diagnoses that were in the questionable and unacceptable range, respectively, will benefit from some additional discussion below.

Autism spectrum disorder.

The relative prevalence levels for DSM-IV and DSM-5 autism spectrum disorder have been an issue of considerable interest in the scientific and general community (29, 30). As can be seen from Table 4, there was no significant change in prevalence at one site, but there was somewhat of a decrease in the DSM-5 autism spectrum rates at the second site. A careful review of data from both sites showed that the decrease at the Stanford site was offset by movement into a new DSM-5 diagnosis called social (or pragmatic) communication disorder (data not shown). Since autism spectrum disorder requires both deficits in social communication and fixated interests/repetitive movement, the more specific deficit assessments in DSM-5 should facilitate more focused treatments for those with social communication deficits only.

The combined prevalence of DSM-IV autistic disorder, Asperger’s disorder, and pervasive developmental disorder not otherwise specified in the clinic population at Baystate was 0.23, and at Stanford it was 0.26. The prevalence of autism spectrum disorder in each clinic population, based on DSM-5 criteria, was 0.24 (95% CI=0.20–0.30) at Baystate and 0.19 (95% CI=0.15–0.24) at Stanford. When patients meeting full criteria for DSM-5 autism spectrum disorder and those meeting full criteria for DSM-5 social (pragmatic) communication disorder were combined as a single group, the DSM-5 prevalence estimates were 0.28 for Baystate and 0.24 for Stanford.

Disruptive mood dysregulation disorder.

This condition was proposed to differentiate patients with pediatric bipolar disorder and those with some similar symptoms who did not progress to adult bipolar disorder. However, the field trial data present a challenging case in which one site, the one with the largest number of patients (predominantly from the inpatient service), produced a good level kappa of 0.49 (95% CI=0.33–0.66), whereas two other sites that obtained patients primarily from outpatient settings produced unacceptable kappas. Since the major distinction between this disorder and oppositional defiant disorder or intermittent explosive disorder is the presence of persistent irritability and anger mood states between frequent rage reactions, it appears that this disorder is more reliably diagnosed in its more severe form and with longitudinal assessment associated with inpatient hospitalization. This is consistent with the “severe mood dysregulation” disorder, which has been differentiated from pediatric bipolar disorder in the NIMH studies that provided the research basis for this proposal (31). Similarly, nonsuicidal self-injury, which was intended to differentiate patients with nonsuicidal self-cutting from those with serious suicidal risk, had unsuccessful field trials at two sites (inadequate sample sizes) and a successful field trial at the third, but the kappa there was unacceptable. Such results stress the importance of site differences in the quality of diagnoses—another issue left unaddressed in previous field trials.

Comorbidity and Its Impact on Reliability

In evaluating the data on comorbidity for major depressive disorder, generalized anxiety disorder, alcohol use disorder, and posttraumatic stress disorder in Figure 1, it is readily apparent that if one clinician provides a thorough assessment in available clinic time and identifies full criteria for all four diagnoses, while the second clinician considers that all of the symptoms can be accommodated by the PTSD diagnosis, the reliability for PTSD would be high (as it is), and kappas for major depressive disorder, alcohol use disorder, and generalized anxiety disorder would be much lower (as they are at this site). The presence of comorbidity of mild traumatic brain injury and PTSD is also documented in Table 1—where there were more DSM-5 diagnoses of PTSD coming from the mild traumatic brain injury stratum than from the PTSD stratum. This high level of comorbidity is also associated with a lower kappa for mild traumatic brain injury at the Houston site.

High levels of comorbidity were noted immediately after the publication of DSM-III in the Epidemiologic Catchment Area (ECA) study (32, 33) and subsequently in the DSM-III-R- and DSM-IV-based National Comorbidity Survey (34). Equally informative was the finding in primary care practices (35) that used the specific screening instruments developed by Spitzer, Williams, and Kroenke for DSM-IV major depressive disorder (the PHQ-9 [36]), generalized anxiety disorder (the GAD-7 [37]), and somatization disorder (the PHQ-15 [38]). They found that pure mood, anxiety, or somatic disorders were the exception and that the mixtures of symptoms or even full diagnostic criteria for three disorders (somatic anxious depression) were most common.

The presence of a strict hierarchy of diagnoses in DSM-III and the earlier Feighner criteria eliminated comorbidity by having higher-level disorders such as schizophrenia, major depression, or autism trump lower-level conditions such as generalized anxiety disorder, PTSD, panic disorder, or ADHD—with the rationale that more severe underlying psychopathology would encompass lower levels. The ECA study demonstrated that a substantial amount of valuable clinical information would be lost if the strict hierarchy was maintained, and the subsequent DSM-III-R and DSM-IV editions removed some but not all hierarchical exclusions (39).

Limitations

A significant limitation of this study was an inability to obtain adequate sample sizes for all disorders studied at all sites. Although the stratified sampling design yielded greater statistical power to assess the precision of the kappa estimates than was accomplished in earlier DSM field trials, we still did not obtain adequate sample sizes for some conditions. The difficulty in recruiting patients with relatively low prevalence mental disorders was not fully appreciated by either the sites or the investigators. In the future, specific pilot studies for such disorders should be considered. These include some newly proposed disorders: hoarding disorder, nonsuicidal self-injury, and attenuated psychotic symptoms syndrome disorder. Also, three of the revised personality disorders would have benefited from additional field trial evidence to complement evidence obtained from literature reviews and other sources. It should also be noted that a “questionable” or “good” value of 0.20 to 0.59 may indicate that a single diagnostic assessment may be insufficient for some diagnoses. It is well known that different informants, clinicians, and patient presentations will elicit new information in successive interviews that may enhance diagnostic precision for some disorders.

As with all diagnoses in previous DSM editions, decisions on inclusion of diagnoses in each edition are based on a wide range of research and clinical evidence, including the data obtained from field trials. Such additional evidence is gathered after each diagnostic revision is published and the criteria are subjected to intensive basic, epidemiological, and clinical research evaluations. Data emerging from such studies since DSM-IV are summarized in the many publications contained in the research resources section of the www.dsm5.org web site.

Conclusions

The DSM-5 Field Trials are the first to directly include patients with diagnostic comorbidity when the explicit diagnostic hierarchies of DSM-III and DSM-IV are removed. However, most of the DSM disorders studied with their proposed revisions have demonstrated that multiple clinicians can communicate effectively about the nature of their patients’ diagnoses with adequate precision to carry out the clinical treatment goals first predicted by Stengel in the early recommendations to develop explicit diagnostic criteria (4).

For some common disorders, such as major depressive disorder and generalized anxiety disorder, the marked heterogeneity of patients who meet these diagnostic criteria and their comorbidity with other disorders are associated with lower reliability levels. Greater attempts to improve both the reliability and validity of these diagnoses are called for. The initial steps taken to address these issues for major depressive disorder in DSM-5 include adding the mixed specifier (to identify closer links with bipolar disorder) and an anxious distress specifier (to identify a subset with anxious depression).

Some disorders that have moved to a more inclusive or dimensional approach (e.g., major neurocognitive disorder, complex somatic symptom disorder, autism spectrum disorder, borderline personality disorder, and alcohol use disorder) demonstrated good to very good reliability and offer a new paradigm for future disorder revisions. However, maximizing the reliability of our current categorical diagnostic conventions is not the only or ultimate goal. As with all of medicine, the goal is to move beyond reliability to a better assessment of the validity of disorders identified by our diagnostic criteria. The DSM-5 proposal to obtain “cross-cutting” measures of 13 psychological symptom domains, described by Narrow et al. (28), is intended to provide a more dimensional description of patient presentations than can be captured by existing DSM-IV diagnostic criteria and boundaries. This approach is also consistent with the NIMH Research Domain Criteria (RDoC) project, which is attempting to identify both biological and symptomatic dimensional measures of psychopathology that correlate with genetic, neuroimaging, and neuropsychological factors irrespective of current diagnostic boundaries (40).

Emil Kraepelin, who pioneered the separation of schizophrenic and affective psychoses into separate diagnostic groups in 1898 (41), noted later in a 1920 publication—prescient in its anticipation of a current polygenetic-environmental interaction model of mental disorders—that the strict separation of these categorical diagnoses was not supported (42). We are now coming to the end of the neo-Kraepelinian era initiated in the U.S. by Robins and Guze (15) with a renewed appreciation of both the benefits and limitations of a strict categorical approach to mental disorder diagnosis (43).

The ultimate goal is to build on the progress achieved with categorical diagnoses by continuing with longitudinal follow-up of patients with these diagnoses, incorporating cross-cutting dimensional measures judiciously into the diagnoses where they prove useful, and in some cases recommending simple external tests (such as a cognitive test for mild neurocognitive disorder) that might improve the reliability and move toward a more mature scientific understanding of mental disorders. A noted philosopher of science, Carl Hempel, observed that “although most sciences start with a categorical classification of their subject matter, they often replace this with dimensions as more accurate measurements become possible” (44).

Clinicians think dimensionally and adjust treatments to target different symptom expressions in patients who may have the same categorical diagnosis. The intent of DSM-5 is to provide a diagnostic structure that will more fully support such dimensional assessments with diagnostic criteria revisions, specifiers, and cross-cutting symptom domain assessments. The goal is to support better measurement-based care and treatment outcome assessment in an era when quality measurement and personalized medicine will require new diagnostic approaches.

Acknowledgments

The authors wish to acknowledge the extensive efforts of the participating clinicians at each of the DSM-5 Field Trial sites. Principal Investigators: Bruce Pollock, M.D., Ph.D., F.R.C.P.C., Michael Bagby, Ph.D., C. Psych., and Kwame McKenzie, M.D. (Centre for Addiction and Mental Health, Toronto, Ont., Canada); Carol North, M.D., M.P.E., and Alina Suris, Ph.D., A.B.P.P. (Dallas VA Medical Center, Dallas, Tex.); Laura Marsh, M.D., and Efrain Bleiberg, M.D. (Michael E. DeBakey VA Medical Center and the Menninger Clinic, Houston, Tex.); Mark Frye, M.D., Jeffrey Staab, M.D., M.S., and Glenn Smith, Ph.D., L.P. (Integrated Mood Clinic & Unit and the Behavioral Medicine Program at Mayo Clinic, Rochester, Minn.); Helen Lavretsky, M.D., M.S. (The Semel Institute for Neuroscience and Human Behavior, Geffen School of Medicine, University of California Los Angeles, Los Angeles, Calif.); Mahendra Bhati, M.D. (University of Pennsylvania School of Medicine, Philadelphia, Pa.); Mauricio Tohen, M.D., Dr.P.H., M.B.A. (University of Texas San Antonio School of Medicine, San Antonio, Tex.); Bruce Waslick, M.D. (Child Behavioral Health, Baystate Medical Center, Springfield, Mass.); Marianne Wamboldt, M.D. (The Children’s Hospital, Aurora, Colo.); Prudence Fisher, Ph.D. (New York State Psychiatric Institute at Columbia University, New York, N.Y.; Weill Cornell Department of Psychiatry at Payne Whitney Manhattan Division, New York, N.Y.; North Shore Child and Family Guidance Center, Roslyn Heights, N.Y.; and Weill Cornell Department of Psychiatry at Payne Whitney Westchester Division, Westchester, N.Y.); Carl Feinstein, M.D., and Debra Safer, M.D. (Stanford University Child & Adolescent Psychiatry Clinic and the Behavioral Medicine Clinic, Palo Alto, Calif.).

The authors also wish to acknowledge the contributions of the DSM-5 work group members who provided the revised diagnostic criteria for DSM-5. Chairs for each of the DSM-5 work groups and study groups: Dan Blazer, M.D., Ph.D., M.P.H. (Chair, Neurocognitive Disorders); William T. Carpenter, Jr., M.D. (Psychotic Disorders); Joel E. Dimsdale, M.D. (Somatic Symptom and Related Disorders); Jan A. Fawcett, M.D. (Mood Disorders); Dilip V. Jeste, M.D. (Chair Emeritus, Neurocognitive Disorders); Charles O’Brien, M.D., Ph.D. (Substance Use and Addictive Disorders); Ronald Petersen, M.D., Ph.D. (Co-Chair, Neurocognitive Disorders); Daniel Pine, M.D. (Child and Adolescent Disorders); Katharine A. Phillips, M.D. (Anxiety, Obsessive-Compulsive and Related, Trauma and Stress-Related, and Dissociative Disorders); Charles F. Reynolds III, M.D. (Sleep-Wake Disorders); David Shaffer, M.D. (ADHD and Disruptive Behavior Disorders); Andrew E. Skodol, M.D. (Personality Disorders); Susan Swedo, M.D. (Neurodevelopmental Disorders); B. Timothy Walsh, M.D. (Eating Disorders); Kenneth J. Zucker, Ph.D. (Sexual and Gender Identity Disorders); Jack D. Burke, Jr., M.D., M.P.H. (Diagnostic Instruments); Steven E. Hyman, M.D. (Diagnostic Spectra); Jane S. Paulsen, Ph.D. (Impairment and Disability Assessment); Susan K. Schultz, M.D. (Lifespan and Development); Lawson Wulsin, M.D. (Psychiatric/General Medical Interface); and Kimberly Yonkers, M.D. (Gender and Cross-Culture).

Finally, the authors wish to acknowledge the ongoing contributions of APA staff, whose extensive support efforts for the DSM-5 work groups made the field trials of proposed DSM-5 diagnostic criteria possible: Jennifer Shupinka, Seung-Hee Hong, Anne Hiller, Alison Beale, and Spencer Case.

References

Kupfer DJ, First MB, Regier DA (ed): A Research Agenda for the DSM-V. Washington, DC, American Psychiatric Association, 2002

Format	RIS (ProCite, Reference Manager) EndNote BibTex Medlars RefWorks
Direct importt
Citation style
Style

Copy to clipboard
Tips for downloading citations

Abstract

Objective

Method

Results

Conclusions

Previous Field Trials

Method

Site Selection

Results

Comorbidity of Diagnoses

Discussion

Adult Patient Results

Child/Adolescent Patient Results

Autism spectrum disorder.

Disruptive mood dysregulation disorder.

Comorbidity and Its Impact on Reliability

Limitations

Conclusions

Acknowledgments

References

Information

Published In

History

Authors

Details

Notes

Funding Information

Metrics

Citations

Export Citations

View options

PDF/EPUB

Login options

Purchase Options

Not a subscriber?

Figures

Other

Share

Share article link

Share