Family, twin, and adoption studies strongly support the importance of genetic factors in the etiology of schizophrenia (
1–
3) and bipolar disorders (
4). However, it has been difficult thus far to identify consistent genetic linkages for either disorder, and the imperfect accuracy of diagnostic methods is probably one of the obstacles (
5–
8). Indeed, linkage analysis can be extremely sensitive to diagnostic misclassification (
9,
10). Our review (
8) of the diagnostic methods used in genetic linkage studies of schizophrenia and bipolar disorders identified several methodological limitations that may have decreased the studies' power to detect linkage, among which are the following: 1) few studies assessed the reliability of diagnoses; 2) in most studies, diagnoses were made by investigators who were not blind to relatives' diagnoses; 3) few studies used a best-estimate diagnosis procedure, i.e., a diagnosis based on personal interview, family history from family informants, and medical records. Such a best-estimate diagnosis is considered the most valid method for diagnosing psychiatric disorders (
5,
11,
12).
Unfortunately, relatively little is known about the reliability of the best-estimate diagnosis procedure, since only a few studies have examined its interrater (
8,
13–
16) or test-retest (
15,
17,
18) reliability. These studies generally indicated satisfactory reliability, although even such a modest level of disagreement may present serious problems for genetic linkage studies. Moreover, these studies relied on kappa statistics, which suffer from several limitations.
First, the kappa is influenced by the sensitivity and specificity of the diagnostic method and by the prevalence of the disorder; poor specificity is probably more worrisome for linkage analyses, owing to the detrimental effect of the false positive diagnoses that result. A group of strategies using latent class analyses has been proposed to address these problems (
19,
20). In the absence of a diagnostic gold standard, latent class analyses assume that the true classification of an individual is unknown but that it exists at some unobserved or latent level and that the various diagnostic assessments are imperfect indicators of this latent classification. By using latent class analyses, it is possible to estimate the sensitivity and specificity with which each of these diagnostic assessments reflects this latent classification. We are aware of no previous study using latent class analyses to evaluate the accuracy of best-estimate diagnoses.
Second, the kappa does not allow identification of the source of disagreement, i.e., which disorders are difficult to distinguish from which diagnoses. For example, in a genetic linkage study of bipolar disorder, confusing bipolar I disorder with schizophrenia does not have the same implications as confusing bipolar I disorder with bipolar II disorder, given usual diagnostic hierarchies. This issue can be addressed by using confusability analyses (
21), which allow identification of the most problematic diagnostic distinctions.
Third, using kappa statistics does not allow identification of factors associated with the reliability of best-estimate diagnoses. Knowing these factors could help one to identify problematic cases and implement strategies for decreasing their impact on linkage analyses (
22). Despite the scarcity of such studies, two potentially important factors have been identified, i.e., a diagnosis of schizoaffective disorder (
8,
14,
15,
18) and blindness to the probands' and relatives' diagnoses (
8). Indeed, concerning the latter factor, we found that unblind consensus best-estimate diagnoses, compared to consensus best-estimate diagnoses made by investigators who were blind to relatives' diagnoses, were biased toward greater continuity with the most prevalent diagnosis of the pedigree (
8).
Therefore, the goals of the present study were 1) to quantify the reliability and accuracy of the best-estimate diagnostic procedure and 2) to identify factors related to the reliability of best-estimate diagnoses, with a particular emphasis on blindness.
METHOD
Subjects
The subjects were drawn from large multigenerational pedigrees densely affected by schizophrenia or bipolar disorders (
23). The DSM-III-R diagnoses arrived at by blind consensus best-estimate diagnosis were as follows: schizophrenia, N=43; schizoaffective disorder, N=7; schizophreniform disorder, N=4; bipolar I disorder, N=39; bipolar II disorder, N=10; major depression, single episode and recurrent, N=29; delusional disorder, N=1; psychosis not otherwise specified, N=1. Unaffected subjects were not included in the analyses, since we achieved perfect agreement on them (
8). All subjects were personally informed of the objectives and the methods of the study, and they each signed a consent form approved by the ethics committee of our research center. The present study group (N=134; mean age=50 years, SD=17) is a major extension of the group used in a previous study (
8).
Consensus Best-Estimate Diagnosis
The diagnostic methods have been previously described (
8). Briefly, the diagnosis was based on an audiotaped Structured Clinical Interview for DSM-III-R (SCID) (
24) obtained from the subject, medical records, and family history interviews with family respondents chosen for their familiarity with the subject. DSM-III-R criteria were used, with a designation of “possible,” “probable,” or “definite” to indicate the degree of certainty about the diagnosis.
In the first step, the SCID information, medical records, and family history were reviewed by a research psychiatrist (M.M. or M.-A. R.) in close collaboration with an experienced research assistant (psychiatric nurse) (L.C. or M.T.), who had spent on average 25 hours gathering data and performing several clinical and epidemiological ratings on the subject from the data gathered from the three sources. The diagnosis made by this field team was termed “unblind consensus best-estimate diagnosis” because, inevitably, it could not be blind to the relatives' diagnoses.
In a second step, the raw information, including the interview and the audiotape, the medical records, and the family history, was edited for family relationships, unblind consensus best-estimate diagnosis, and previous clinical diagnoses. The edited information was reviewed independently by two blind research psychiatrists (D.C., J.-P.F., N.M., L.N., A. Pirés, H.W., A.-M.P., Y.G., C.D., J.-C.L., and/or A. Potvin). Therefore, besides the details edited to secure blindness, the blind and the unblind psychiatrists were exposed to similar information. These diagnoses were termed “blind independent best-estimate diagnoses.”
In the third step, the edited information was presented to and discussed by a panel of four psychiatrists, including the two first blind psychiatrists. These diagnoses were termed “blind consensus best-estimate diagnoses.”
Reliability and Accuracy
Interrater reliability for the agreement between the unblind and blind consensus best-estimate diagnoses and between the blind independent best-estimate diagnoses was computed by using a weighted kappa (
25), for which 1- and 2-point disagreements on certainty had weights of 0.67 and 0.33, respectively.
We also used diagnostic confusability analyses, as described by Übersax (
21). Confusability (C) is defined as the number of times (N) one rater diagnoses A and the other rater diagnoses B (N
AB) plus the number of times the opposite occurs (N
BA), divided by the geometric mean of the frequencies with which each of the two appears in any pair of opinions: C=(N
AB+N
BA)/(N
A×N
B)½. (N
A and N
B are the numbers of times A and B respectively are diagnosed by both diagnosticians.)
Latent class analyses were performed by using the program developed by Szatmari et al. (
20), yielding, for each diagnosis, estimates of sensitivity, specificity, and prevalence and a global fit. Three diagnoses were included in the analyses, i.e., the two blind independent best-estimate diagnoses and the unblind consensus best-estimate diagnosis. Because of the assumption of independence of observations that underlies latent class analyses, we could not include the blind consensus best-estimate diagnosis, because it was not independent from the blind independent best-estimate diagnoses.
Prediction of Disagreements
Variables potentially associated with disagreements were chosen on the basis of prior evidence that they may predict diagnostic disagreements (
8,
26,
27). These variables were rated by the field researchers, who used the information gathered for diagnosis and were blind to the present research questions and to the concordance between the various best-estimate diagnoses. The independent variables included 1) number of psychiatric hospitalizations; 2) age at onset of the first episode meeting the criteria for diagnosis; 3) Global Assessment Scale (
28) score during the premorbid period (5 years preceding onset) and during periods of stabilization between hospitalizations; 4) duration of illness since onset; 5) gender; 6) presence of mixed psychotic and affective symptoms (considered present when
either the unblind
or the blind consensus best-estimate diagnosis was schizoaffective disorder or affective disorder with mood-incongruent features; repeating the analyses when
only the unblind or the blind consensus best-estimate diagnosis was used to define mixed psychotic and affective symptoms led to similar results); 7) level of certainty (definite versus probable/possible) of the unblind field consensus best-estimate diagnosis (using the level of certainty of the blind consensus best-estimate diagnosis led to similar results); 8) quality of information, rated as excellent/very good versus acceptable/poor (scale provided on request); and 9) the predominant diagnosis in the pedigree, either schizophrenia or bipolar disorders.
Three types of disagreements were analyzed. First, 20 disagreements on diagnosis between the unblind and blind consensus best-estimate diagnoses were compared to 86 agreements between the unblind and blind consensus best-estimate diagnoses on diagnosis and level of certainty. Second, 14 diagnostic disagreements between the blind independent best-estimate diagnoses were compared to 100 agreements between the independent best-estimate diagnoses. Third, 28 disagreements on certainty about the same diagnosis between the unblind and blind consensus best-estimate diagnoses were compared to the 86 agreements on certainty. We did not analyze disagreements on certainty between the blind independent best-estimate diagnoses because there were only nine such instances.
We first performed univariate logistic regression on each independent variable (Wilcoxon tests for continuous variables led to similar conclusions). In a second step, we used multivariate stepwise (forward) logistic regression, with a p value of 0.05 as the threshold for inclusion in the final model. All statistical tests reported were two-tailed.
RESULTS
Reliability
The overall weighted kappa for agreement between the blind and the unblind consensus best-estimate diagnoses was 0.69 (95% confidence interval=0.62–0.76). For individual diagnoses (
table 1), according to standard guidelines for judging agreement (
25), agreement could be considered as very poor for schizoaffective disorder, fair to good for schizophreniform disorder and recurrent major depression, and excellent (e.g., ≥0.75) for the other diagnoses.
The overall weighted kappa for agreement between blind independent best-estimate diagnoses was 0.80 (95% confidence interval=0.72–0.88). Agreement could be considered good for schizoaffective disorder and bipolar II disorder and excellent for the other diagnoses (
table 1). It is noteworthy that, qualitatively, for every diagnosis except bipolar II disorder the level of agreement between the blind independent best-estimate diagnoses was better than the agreement between the unblind and blind consensus best-estimate diagnoses, although the confidence intervals overlapped, except for schizoaffective disorder.
Confusability
Table 2 shows the confusability coefficients. For unblind versus blind consensus best-estimate diagnoses, schizoaffective disorder was involved in all of the three situations with the highest confusability coefficients, i.e., schizoaffective disorder versus schizophreniform disorder, bipolar I disorder, and schizophrenia, and the coefficient was greater than 0.1 in all three instances. In decreasing order of diagnostic difficulty, other high confusability coefficients included those for single-episode major depression versus recurrent major depression, schizophrenia versus schizophreniform disorder, bipolar I disorder versus schizophreniform disorder, bipolar I disorder versus bipolar II disorder, and bipolar I disorder versus schizophrenia.
Conversely, among the blind independent best-estimate diagnoses no confusability coefficient reached a level of 0.1. In this situation, the most difficult distinctions were, in decreasing order, schizoaffective disorder versus bipolar I disorder, recurrent major depression versus single-episode major depression, bipolar I disorder versus bipolar II disorder, schizophrenia versus schizophreniform disorder, and bipolar II disorder versus single-episode major depression.
Latent Class Analyses
For each diagnosis, two latent class analysis models were contrasted. First, in the full model, sensitivity and specificity for the independent psychiatrists were constrained to be similar, since these diagnoses were derived by a single group of diagnosticians who were permuted; however, sensitivity and specificity were allowed to differ between the blind independent best-estimate diagnoses and the unblind consensus best-estimate diagnoses. Second, in the restricted model, sensitivity and specificity were constrained to be similar for all three types of diagnoses. Therefore, comparing the chi-square goodness of fit values for the full and the restricted models provides a test (df=2) of the null hypothesis that the degrees of diagnostic accuracy are similar among the diagnostic methods, e.g., that blindness does not influence diagnostic accuracy.
Table 3 shows the results of these latent class analyses. The overall fit of the full models was appropriate (p>0.05) for all diagnoses except schizoaffective disorder (χ
2=9.91, df=3, p<0.05). Specificity was very good, the lowest value being 0.89, for bipolar I disorder from unblind consensus best-estimate diagnoses. There were more-severe problems with sensitivity, particularly for schizoaffective disorder derived from unblind consensus best-estimate diagnoses. Therefore, these analyses identify sensitivity as the primary source of imperfect reliability.
For schizophreniform disorder and bipolar II disorder, the full model was not found to be significantly better than the restricted one, meaning that there was no evidence for an effect of blindness. Conversely, for the following disorders the fit was significantly better for the full model: schizophrenia, schizoaffective disorder, bipolar I disorder, recurrent major depression, and single-episode major depression (
table 3).
Effect of Blindness
The potential influence of blindness to probands' and relatives' diagnoses was more directly tested by examining the 20 cases of diagnostic disagreement between the unblind and blind consensus best-estimate diagnoses. For each of these 20 cases, we determined which diagnosis, unblind or blind, had greater continuity with the most predominant diagnosis of the pedigree, according to diagnostic hierarchies used in our linkage studies (
23), which are shown in
table 4. For example, in a predominantly schizophrenia pedigree, a diagnosis of schizophrenia would be considered to have greater continuity than a diagnosis of bipolar disorder. We found greater continuity for the unblind consensus best-estimate diagnosis in 19 cases (95%). If blindness had no effect on the diagnoses, the blind and the unblind diagnoses would each have had a 50/50 chance of having greater continuity. We tested the statistical significance of this departure from a 0.50 proportion by using a binomial test and found the difference to be highly significant (p<0.0001). Since the exact placement of schizoaffective disorder is the subject of considerable controversy, we repeated the analyses by putting schizoaffective disorder at level 2 of the hierarchies for both disorders, which did not affect the results at all.
Predictors of Disagreements
Diagnostic disagreements between unblind and blind consensus diagnoses. Table 5 (first column) shows univariate comparisons of diagnostic disagreements and agreements between unblind and blind consensus diagnoses. The following variables were associated with diagnostic disagreements: mixed psychotic and affective symptoms (χ
2=22.24), a predominant pedigree diagnosis of bipolar disorder (χ
2=22.24), and a lower level of certainty of the unblind consensus best-estimate diagnosis (χ
2=4.88).
We then used stepwise backward multivariate logistic regression to determine the best set of predictors. The final best-fitting model included three variables: 1) mixed psychotic and affective symptoms (χ2=17.53, odds ratio=23.26, p<0.0001); 2) shorter duration of illness (χ2=4.88, odds ratio=0.63 for a 5-year increment, p<0.05); and 3) lower level of certainty of the unblind consensus best-estimate diagnosis (χ2=9.43, odds ratio=12.82, p<0.01). Additional logistic regressions revealed that the effect of the predominant pedigree diagnosis was lost whenever the presence of mixed psychotic and affective symptoms was included in the model, because 29% of the subjects in bipolar disorder pedigrees had a diagnosis of mixed psychotic and affective symptoms and only 11% of the subjects in schizophrenia pedigrees had mixed symptoms. When a predicted probability of diagnostic disagreement of 0.14 was used as a cutoff, this model had sensitivity and specificity of 0.80 and correctly classified 80% of the subjects.
Diagnostic disagreements between blind independent diagnoses. The second column of
table 5 shows univariate comparisons of diagnostic disagreements and agreements between two independent psychiatrists. These analyses were performed to disentangle the effect of blindness from the effects of other independent variables in the previous analyses comparing unblind and blind consensus best-estimate diagnoses. It is noteworthy that only three out of 14 disagreements between the blind independent best-estimate diagnoses overlapped the diagnostic disagreements between the unblind and blind consensus best-estimate diagnoses. This limited overlap prevented these analyses from being redundant. These analyses revealed that mixed psychotic and affective symptoms did not predict disagreement between the blind independent best-estimate diagnoses, contrary to what was observed in the comparisons of diagnostic disagreements and agreements between the unblind and blind consensus best-estimate diagnoses. For the other variables the patterns were similar, although they achieved an expected lesser degree of statistical significance because of the smaller number. Multivariate analyses were not used because of the insufficient number of subjects.
Disagreements on certainty between unblind and blind consensus diagnoses. In univariate analyses, a predominant pedigree diagnosis of bipolar disorder and poorer quality of information predicted disagreement on the degree of certainty (
table 5, third column). In multivariate analyses, only poorer quality of information (χ
2=8.55) still yielded a significant effect.
DISCUSSION
Reliability and Accuracy
We observed in this extended (
8) study group a very satisfactory degree of diagnostic agreement, comparable to what was observed in previous studies of the best-estimate diagnosis (
13–
16). Moreover, we certainly underestimated the kappas by excluding unaffected subjects. While such reliability figures may be satisfactory for other research purposes, this proportion of disagreements nevertheless constitutes a problem for genetic linkage studies because of their sensitivity to phenotypic misclassification (
7–
10).
Confusability coefficients provide important insights into the source of imperfect reliability. Schizoaffective disorder was particularly problematic, with frequent disagreements with schizophrenia, schizophreniform disorder, and bipolar I disorder. The consequences of such disagreements for linkage analyses can be considerable. For example, if unblind diagnosis were used for linkage analyses of pedigrees with predominantly schizophrenia, a “true” case of bipolar I disorder could end up being confused with schizoaffective disorder. In most linkage studies of schizophrenia, schizoaffective disorder is located at the first or second level of the diagnostic hierarchy, while bipolar I disorder is located at a very low level. Consequently, such a “real” case of bipolar I disorder would end up being included in the restrictive definition of the phenotype. While restricted definitions are aimed at identifying a core phenotype to minimize the risk of false positives, such diagnostic misclassifications pose the risk of introducing false positives.
Latent class analyses revealed that the main source of imperfect reliability was poor sensitivity, a finding similar to previous results by Faraone et al. (
29). This result could be reassuring in the context of genetic analyses, since poor specificity, which leads to false positive diagnoses, has a more serious impact on linkage analyses than does poor sensitivity, which leads to false negative diagnoses. To verify this, we computed the positive predictive value (last two columns of
table 3) and found less than optimal positive predictive value, especially for schizoaffective disorder. For example, if unblind consensus best-estimate diagnoses were used, almost all of the diagnoses of schizoaffective disorder would in fact be false positives, while if blind independent best-estimate diagnoses were used, a sizable proportion (36%) would be considered false positives. Such rates of false positives are likely to pose serious problems in linkage analyses.
Prediction of Disagreements
Five factors were found to be significantly associated with reliability.
1. Blindness. As previously reported (
8), blindness to the most predominant diagnosis in a pedigree multiply affected with either schizophrenia or bipolar disorders had significant effects on diagnostic outcome, as suggested by a) the greater continuity of the unblind consensus best-estimate diagnosis with the most predominant diagnosis in the pedigree in cases of disagreement between the unblind and blind consensus best-estimate diagnoses and b) the significant difference in sensitivity and specificity between the unblind consensus and blind independent best-estimate diagnoses, revealed by the latent class analyses, for five out of seven diagnostic categories. The potential implications of these findings suggest the need for further studies. Indeed, we could locate only one other study addressing the impact of blindness to diagnoses in relatives (
30); that study showed no such influence. However, that study used a relatively small number of subjects, used diagnoses only from personal interview, and focused exclusively on affective disorders; these methodological differences render comparison with the present study difficult.
2. Mixed psychotic and affective symptoms. Diagnoses involving mixed symptoms, including schizoaffective disorder and affective disorders with mood-incongruent features, were strongly associated with disagreements between unblind and blind consensus best-estimate diagnoses. However, diagnoses with mixed psychotic and affective symptoms were not associated with disagreements between two blind independent best-estimate diagnoses, suggesting that the effect of mixed psychotic and affective symptoms was linked to that of blindness. This suggests that cases with both psychotic and affective features are particularly difficult to diagnose. On the basis of clinical experience, one can speculate that the observed difficulties in the differential diagnosis of cases with mixed psychotic and affective symptoms according to DSM-III-R criteria include a) retrospective assessment of the temporal dissociation of psychotic and affective symptoms, particularly affecting the distinction between bipolar I disorder and schizoaffective disorder, which confusability analyses revealed as problematic; b) inconsistencies in psychotic episodes across a lifetime; for example, early-onset bipolar disorders often have a predominantly psychotic onset, which may render the distinction between schizophrenia and schizoaffective disorder difficult (
31,
32); c) inaccurate assessment of the relative durations of affective and schizophrenic features, which makes it difficult to distinguish between schizophrenia and schizoaffective disorder (
33), which was found to be a problem in the confusability analyses. Our data suggest that when such difficulties are encountered, unblind diagnosticians will be more inclined to assign diagnoses that have greater continuity with the most predominant diagnosis of the pedigree.
3. Level of certainty. Cases with lower levels of certainty were more likely to lead to diagnostic disagreements, suggesting that diagnosticians can identify the difficult cases that are more likely to lead to diagnostic disagreements.
4. Duration of illness. A shorter duration of illness predicted disagreement between unblind and blind consensus best-estimate diagnoses, a finding that is consistent with results of studies of agreement between family-history and best-estimate diagnoses (
26,
27). Two competing explanations can be offered: a) cases with longer durations of illness yielded more clinical information and b) the clinical picture in cases of mixed psychotic and affective symptoms becomes clearer with time. Indeed, in many cases the onset of bipolar disorder is predominantly psychotic (
31,
32), with the affective phenomena often becoming clearer over time. Unfortunately, the present data do not allow us to perform definitive tests of these competing hypotheses.
5. Quality of information. As expected, poorer quality of information was associated with disagreement on certainty. Also, it was more difficult to reach a definitive diagnosis for cases with poorer information, since additional analyses (available on request) revealed an association between quality of information and level of certainty.
Implications
These data have important practical implications for linkage and family studies of psychiatric disorders. First, a crucial issue is whether unblind or blind consensus best-estimate diagnoses should be used for linkage analyses. Examining the potential consequences of unblind versus blind diagnoses may guide a discussion of this issue. Since unblind diagnoses had greater continuity with the most prevalent diagnosis in the pedigree, they might generate more false positive diagnoses for genetic linkage analyses; e.g., the diagnosis assigned to a subject might be at too high a level in the diagnostic hierarchy. Conversely, blind diagnoses could increase the risk of false negative diagnoses, e.g., subjects' being assigned diagnoses at levels lower than should be. Moreover, latent class analyses revealed that the positive predictive value was higher in most instances for blind diagnoses, adding further evidence that unblind diagnoses are more likely to yield false positives than are blind diagnoses. Since linkage analyses are more sensitive to false positive than to false negative diagnoses, the present results suggest using blind diagnoses, as proposed by other investigators (
5–
7).
Second, our data and those of other groups (
13–
15,
18) suggest a need for a careful handling of cases with mixed psychotic and affective symptoms in genetic linkage analyses, owing to a) the modest reliability of schizoaffective disorder; b) the strong association of mixed symptoms with diagnostic disagreements and the liability of cases with mixed symptoms to be affected by lack of blindness; c) confusability analyses suggesting that the diagnostic disagreements regarding such cases affect diagnoses at very different levels of the diagnostic hierarchies. These findings suggest that it may be safer not to place schizoaffective disorder at the same level in the diagnostic hierarchy as schizophrenia or bipolar disorders when conducting linkage analyses for either disorder.
Third, the association between duration of illness and reliability suggests that longitudinal follow-up may increase diagnostic reliability.
Fourth, our results suggest that it is possible to identify cases that are more likely to lead to diagnostic disagreements. Indeed, mixed psychotic and affective symptoms, shorter duration of illness, less certainty of diagnosis, and poorer quality of information were associated with poor reliability. These variables could be used to graduate the certainty of diagnosis in linkage analyses, by using logistic regression models (
22). Alternatively, probability of belonging to the latent class could be used to weight the diagnostic certainty. The use of such strategies in linkage analyses certainly requires further investigation.
Limitations
In interpreting the present findings, several limitations must be taken into account. First, the present study group included members of families with severe and highly familial psychotic disorders. Therefore, the generalizability of the present findings to families with nonpsychotic disorders or nonfamilial disorders is unknown. Second, the modest size of some subgroups may have decreased our power to detect the effect of some variables associated with reliability. Third, we do not know whether the present conclusions apply to diagnostic criteria other than DSM-III-R.