Sources of Bias in Guideline Development
To the Editor: Treatment guidelines have become increasingly important for institutional policies, quality management, and jurisdiction in mental health care. We were engaged in the development of the treatment guideline on aggressive behavior for the German Association for Psychiatry and Psychotherapy from 2007 to 2008 (
1 ). According to the standard most frequently used in guideline development, a clearly defined algorithm was utilized for the step from evidence to recommendations. This method was adopted from the U.S. Agency for Healthcare Research and Quality (AHRQ). It comprises different levels of evidence, from the highest (meta-analysis of at least three randomized controlled studies) to the lowest (expert opinions).
Using this method, for example, we found good evidence from several meta-analyses for use of antipsychotics and benzodiazepines as emergency medications. Consequently, we decided to recommend these drugs with the highest level of evidence. However, representatives of service users did not agree. Not surprisingly, they did not discuss issues such as p values and effect sizes but rather expressed their general concerns about the use of coercion and involuntary medication. Perhaps this was attributable to their unfamiliarity with methods of evidence-based medicine, but perhaps they were simply right in some way. This led us to reconsider some aspects of methodology in the development of guidelines. We identified at least five sources of bias in the AHRQ method and thus in many existing guidelines.
First, levels of evidence are related to the quality of studies, not to reported effect sizes. Thus a small amount of evidence of efficacy can lead to a strong recommendation. Second, external validity of randomized controlled trials is rather limited. This is particularly the case for issues such as violence and coercion: patients who give informed consent for randomized controlled studies often differ considerably from real-world patients. A third source of bias is that the absence of evidence for older treatment options leads to treatment recommendations for newer, well-examined, and frequently more expensive options without evidence of superiority.
Fourth, the ethical framework of many clinically relevant objectives cannot be represented sufficiently in randomized controlled trials. In particular, issues such as involuntary treatment and use of coercive measures have outcomes not only on the patient level but also on the level of staff, patients' relatives, and society as a whole, which should be taken into account. Finally, existing evidence is biased by a predominance of pharmacotherapy, which may decrease acceptance among service users.
The most recent guidelines, such as the update on schizophrenia by the National Institute for Health and Clinical Excellence (NICE) (
2 ) and the German guideline on unipolar depression (
3 ), weaken the strong link between evidence levels and recommendations. In the German guideline, four levels of evidence still correlate with four levels of recommendations. However, during guideline development the recommendations could be modified (upgraded or downgraded) after the developers took into account ethical obligations, clinical relevance of the effectiveness measures used, applicability of results to certain patient groups, preferences of patients, and the likelihood of implementation in routine clinical practice. The NICE update utilized a different approach. As a result of meta-analyses conducted by the guideline development group, a short "evidence summary" is given, which is the basis for making or not making a clinical recommendation.
Such modifications of links between evidence (more correctly, study quality) and recommendations can avoid much of the bias described above and highlight the role of consensus. The price, however, is a loss of transparency in regard to how recommendations are derived from evidence. Upgrading the role of consensus implies that the composition of the group and the type of applied consensus techniques have a high and rather unknown impact. This pertains also to use of the GRADE grid in developing guidelines, which has recently been suggested (5). In the GRADE grid, evidence is classified into high, moderate, low, and very low, and recommendations are classified as strong, weak, or none. Without a strict algorithm from evidence to recommendations, the latter can be upgraded by aspects such as high effect sizes and a strong dose-effect relationship and can be downgraded by study limitations, inconsistent results, and other bias. In addition to being affected by the level of evidence, the level of recommendation can be influenced by values and preferences, cost, and the relationship of desired and undesired effects.
In the development of future clinical guidelines, much attention should be paid to these methodological issues, with the aim of acquiring a maximum amount of transparency.