This guideline was developed using a process intended to meet standards of the Institute of Medicine (2011; now known as the National Academy of Medicine). The process is fully described in a document available on the APA Web site at: www.psychiatry.org/psychiatrists/practice/clinical-practice-guidelines/guideline-development-process.
Management of Potential Conflicts of Interest
Members of the Guideline Writing Group (GWG) are required to disclose all potential conflicts of interest before appointment, before and during guideline development, and upon publication. If any potential conflicts are found or disclosed during the guideline development process, the member must recuse themself from any related discussion and voting on a related recommendation. The members of both the GWG and the Systematic Review Group (SRG) reported no conflicts of interest. The Disclosures section includes more detailed disclosure information for each GWG and SRG member involved in the guideline’s development.
Guideline Writing Group Composition
The GWG was initially composed of six psychiatrists with general research and clinical expertise (G. A. K., J. M. A., S. B., J. M. L., R. M., M. S.). This non-topic-specific group was intended to provide diverse and balanced views on the guideline topic to minimize potential bias. Three psychiatrists (L. C.-K., K. J. N., J. M. O.) and one psychologist (C. S.) were added to provide subject matter expertise in BPD. One fellow (A. D.) was involved in the guideline development process. The vice-chair of the GWG (L. J. F.) provided methodological expertise on such topics as appraising the strength of research evidence. The GWG was also diverse and balanced with respect to other characteristics, such as geographical location and demographic background. Emotions Matter and National Council for Mental Wellbeing reviewed the draft and provided perspectives from patients, families, and other care partners.
Systematic Review Methodology
The methods for this systematic review follow the Agency for Healthcare Research and Quality (AHRQ) Methods Guide for Effectiveness and Comparative Effectiveness Reviews (available at www.effectivehealthcare.ahrq.gov/methodsguide.cfm) and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) checklist (Moher et al. 2015). The final protocol of this review was registered on PROSPERO (Registration #: CRD42020194098). All methods and analyses were determined a priori.
This guideline is based on an initial systematic search of available research evidence conducted by Dr. Evidence (Santa Monica, CA) using the DOC Data 2.0 software platform, and an updated search conducted by RTI. The systematic search of available research evidence used MEDLINE, Cochrane Library, EMBASE, and PsycINFO databases, with specific search terms and limits as described in Appendix B. Results covered the period from the start of each database to June 15, 2020, with additional searches in MEDLINE and PsycINFO through September 24, 2021. Search strategies used a variety of terms, medical subject headings (MeSH), and major headings and were limited to English-language and human-only studies (see Appendix B). Case reports, comments, editorials, and letters were excluded. To minimize retrieval bias, we manually searched reference lists of landmark studies and background articles on this topic for relevant citations that electronic searches might have missed.
Studies were included if participants were 13 years of age or older and diagnosed with BPD as defined by DSM-IV, DSM-IV-TR, DSM-5 (Section II or Section III), or ICD-10, as applicable. Interventions of interest included psychotherapies, pharmacotherapies, and other interventions. Comparator conditions included active interventions, placebo, treatment as usual, waiting-list controls, or GPM. Multiple outcomes were included related to key symptoms and domains of BPD, functioning, quality of life, adverse effects, and study withdrawal rates, among others (see Appendix B). Studies were excluded if BPD did not account for at least 75% of the total sample. Other exclusion criteria included small sample size (N < 50 for nonrandomized clinical trials or observational studies), lack of a comparator group, short treatment duration (< 8 weeks), or studies done outside of very high Human Development Index (HDI) countries. Citations to registry links, abstracts, and proceedings were not included unless also published in a peer-reviewed journal because they did not include sufficient information to evaluate the risk of bias of the study.
For each trial identified for inclusion from the search, detailed information was extracted by RTI, with processes that included verifications and quality checks on data extraction. In addition to specific information about each reported outcome, extracted information included citation; study design; treatment arms (including dosages, sample sizes); co-intervention, if applicable; trial duration and follow-up duration, if applicable; country; setting; funding source; sample characteristics (e.g., mean age,% non-White,% female,% with co-occurring condition); and rates of attrition, among other data elements. Summary tables (see Appendix D and Appendix G) include specific details for each study identified for inclusion from the literature search. Factors relevant to risk of bias were also identified for each RCT that contributed to a guideline statement. Risk of bias was determined using the Cochrane Risk of Bias 2.0 tool (Sterne et al. 2019), and ratings are included in summary tables (see Appendix D), with specific factors contributing to the risk of bias for each study shown in Appendix E (McGuinness and Higgins 2021).
Available guidelines from other organizations were also reviewed (see Appendix F) (Canadian Agency for Drugs and Technologies in Health 2018; Finnish Medical Society Duodecim 2020; Herpertz et al. 2007; National Health and Medical Research Council 2012; National Institute for Health and Care Excellence 2009; Simonsen et al. 2019).
Rating the Strength of Supporting Research Evidence
Strength of supporting research evidence describes the level of confidence that findings from scientific observation and testing of an effect of an intervention reflect the true effect. Confidence is enhanced by such factors as rigorous study design and minimal potential for study bias.
Ratings were determined, in accordance with the AHRQ’s Methods Guide for Effectiveness and Comparative Effectiveness Reviews (Agency for Healthcare Research and Quality 2014), by the methodologist (L. J. F.) and reviewed by members of the SRG and GWG. Available clinical trials were assessed across four primary domains: risk of bias, consistency of findings across studies, directness of the effect on a specific health outcome, and precision of the estimate of effect.
The ratings are defined as follows:
▫
High (denoted by the letter A) = High confidence that the evidence reflects the true effect. Further research is very unlikely to change our confidence in the estimate of effect.
▫
Moderate (denoted by the letter B) = Moderate confidence that the evidence reflects the true effect. Further research may change our confidence in the estimate of effect and may change the estimate.
▫
Low (denoted by the letter C) = Low confidence that the evidence reflects the true effect. Further research is likely to change our confidence in the estimate of effect and is likely to change the estimate.
The AHRQ has an additional category of insufficient for evidence that is unavailable or does not permit estimation of an effect. The APA uses the low rating when evidence is insufficient because there is low confidence in the conclusion, and further research, if conducted, would likely change the estimated effect or confidence in the estimated effect.
Rating the Strength of Guideline Statements
Each guideline statement is separately rated to indicate strength of recommendation and strength of supporting research evidence. Strength of recommendation describes the level of confidence that potential benefits of an intervention outweigh potential harms. This level of confidence is informed by available evidence, which includes evidence from clinical trials as well as expert opinion and patient values and preferences. As described earlier in “Rating the Strength of Supporting Research Evidence,” this rating is a consensus judgment of the authors of the guideline.
There are two possible ratings: recommendation or suggestion. A recommendation (denoted by the numeral 1 after the guideline statement) indicates confidence that the benefits of the intervention clearly outweigh the harms. A suggestion (denoted by the numeral 2 after the guideline statement) indicates greater uncertainty. Although the benefits of the statement are still viewed as outweighing the harms, the balance of benefits and harms is more difficult to judge, or either the benefits or the harms may be less clear. With a suggestion, patient values and preferences may be more variable, and this can influence the clinical decision that is ultimately made. These strengths of recommendation correspond to ratings of strong or weak (also termed conditional) as defined under the GRADE method for rating recommendations in clinical practice guidelines (described in publications such as Guyatt et al. 2008 and others available on the Web site of the GRADE Working Group at www.gradeworkinggroup.org).
When a negative statement is made, ratings of strength of recommendation should be understood as meaning the inverse of the above (e.g., recommendation indicates confidence that harms clearly outweigh benefits).
The GWG determined ratings of the strength of the guideline statement by a modified Delphi method using blind iterative voting and discussion. For the GWG members to be able to ask for clarifications about the evidence, the wording of statements, or the process, the vice-chair of the GWG served as a resource and did not vote on statements. The chair and other formally appointed GWG members were eligible to vote.
In weighing potential benefits and harms, GWG members considered the strength of supporting research evidence, their own clinical experiences and opinions, and patient preferences. For recommendations, at least 9 out of 10 members must have voted to recommend the intervention or assessment after five rounds of voting, and at most one member was allowed to vote other than “recommend” for the intervention or assessment. On the basis of the discussion among the GWG members, adjustments to the wording of recommendations could be made between the voting rounds. If this level of consensus was not achieved, the GWG could have agreed to make a suggestion rather than a recommendation. No suggestion or statement could have been made if three or more members voted “no statement.” Differences of opinion within the GWG about ratings of strength of recommendation, if any, are described in the subsection “Balancing of Potential Benefits and Harms” for each statement in Appendix F.
Use of Guidelines to Enhance Quality of Care
Clinical practice guidelines can help enhance quality by synthesizing available research evidence and delineating recommendations for care on the basis of the available evidence. In some circumstances, practice guideline recommendations will be appropriate to use in developing quality measures. Guideline statements can also be used in other ways, such as for educational activities or electronic clinical decision support, to enhance the quality of care that patients receive. Furthermore, when availability of services is a major barrier to implementing guideline recommendations, improved tracking of service availability and program development initiatives may need to be implemented by health organizations, health insurance plans, federal or state agencies, or other regulatory programs.
Typically, guideline recommendations that are chosen for development into quality measures will advance one or more aims of the Institute of Medicine’s report Crossing the Quality Chasm (Institute of Medicine 2001) and the ongoing work guided by the AHRQ-led National Quality Strategy by facilitating care that is safe, effective, patient-centered, timely, efficient, and equitable. To achieve these aims, a broad range of quality measures (Watkins et al. 2015) is needed that spans the entire continuum of care (e.g., prevention, screening, assessment, treatment, continuing care), addresses the different levels of the health system hierarchy (e.g., system-wide, organization, program/department, individual clinicians), and includes measures of different types (e.g., process, outcome, patient-centered experience). Emphasis is also needed on factors that influence the dissemination and adoption of evidence-based practices (Drake et al. 2008; Greenhalgh et al. 2004; Horvitz-Lennon et al. 2009a).
Measure development is complex and requires detailed development of specification and pilot testing (Center for Health Policy/Center for Primary Care and Outcomes Research and Battelle Memorial Institute 2011; Fernandes-Taylor and Harris 2012; Iyer et al. 2016; Pincus et al. 2016; Watkins et al. 2011). Generally, however, measure development should be guided by the available evidence and focused on measures that are broadly relevant and meaningful to patients, clinicians, and policy makers. Measure feasibility is another crucial aspect of measure development but is often decided based on current data availability, which limits opportunities for development of novel measurement concepts. Furthermore, innovation in workflow and data collection systems can benefit from looking beyond practical limitations in the early development stages in order to foster development of meaningful measures.
Often, quality measures will focus on gaps in care or on care processes and outcomes that have significant variability across specialties, health care settings, geographical areas, or patients’ demographic characteristics. Administrative databases, registries, and data from electronic health records can help identify gaps in care and key domains that would benefit from performance improvements (Acevedo et al. 2015; Patel et al. 2015; Watkins et al. 2016). Nevertheless, for some guideline statements, evidence of practice gaps or variability will be based on anecdotal observations if the typical practices of psychiatrists and other health professionals are unknown. Variability in the use of guideline-recommended approaches may reflect appropriate differences that are tailored to the patient’s preferences, treatment of co-occurring illnesses, or other clinical circumstances that may not have been studied in the available research. On the other hand, variability may indicate a need to strengthen clinician knowledge or to address other barriers to adoption of best practices (Drake et al. 2008; Greenhalgh et al. 2004; Horvitz-Lennon et al. 2009a). When performance is compared among organizations, variability may reflect a need for quality improvement initiatives to improve overall outcomes but could also reflect case-mix differences such as socioeconomic factors or the prevalence of co-occurring illnesses.
When a guideline recommendation is considered for development into a quality measure, it must be possible to define the applicable patient group (i.e., the denominator) and the clinical action or outcome of interest that is measured (i.e., the numerator) in validated, clear, and quantifiable terms. Furthermore, the health system’s or clinician’s performance on the measure must be readily ascertained from chart review, patient-reported outcome measures, registries, or administrative data. Documentation of quality measures can be challenging, and, depending on the practice setting, can pose practical barriers to meaningful interpretation of quality measures based on guideline recommendations. For example, when recommendations relate to patient assessment or treatment selection, clinical judgment may need to be used to determine whether the clinician has addressed the factors that merit emphasis for an individual patient. In other circumstances, standardized instruments can facilitate quality measurement reporting, but it is difficult to assess the appropriateness of clinical judgment in a validated, standardized manner. Furthermore, utilization of standardized assessments remains low (Fortney et al. 2017), and clinical findings are not routinely documented in a standardized format. Many clinicians appropriately use free text prose to describe symptoms, response to treatment, discussions with family, plans of treatment, and other aspects of care and clinical decision-making. Reviewing these free text records for measurement purposes would be impractical, and it would be difficult to hold clinicians accountable to such measures without significant increases in electronic medical record use and advances in natural language processing technology.
Conceptually, quality measures can be developed for purposes of accountability, for internal or health system–based quality improvement, or both. Accountability measures require clinicians to report their rate of performance of a specified process, intermediate outcome, or outcome in a specified group of patients. Because these data are used to determine financial incentives or penalties based on performance, accountability measures must be scientifically validated, have a strong evidence base, and fill gaps in care. In contrast, internal or health system–based quality improvement measures are typically designed by and for individual providers, health systems, or payers. They typically focus on measurements that can suggest ways for clinicians or administrators to improve efficiency and delivery of services within a particular setting. Internal or health system–based quality improvement programs may or may not link performance with payment, and, in general, these measures are not subject to strict testing and validation requirements. Quality improvement activities, including performance measures derived from these guidelines, should yield improvements in quality of care to justify any clinician burden (e.g., documentation burden) or related administrative costs (e.g., for manual extraction of data from charts, for modifications of electronic medical record systems to capture required data elements). Possible unintended consequences of any derived measures would also need to be addressed in testing of a fully specified measure in a variety of practice settings. For example, highly specified measures may lead to overuse of standardized language that does not accurately reflect what has occurred in practice. If multiple discrete fields are used to capture information on a paper or electronic record form, data will be easily retrievable and reportable, but oversimplification is a possible unintended consequence of measurement. Just as guideline developers must balance the benefits and harms of a particular guideline recommendation, developers of performance measures must weigh the potential benefits, burdens, and unintended consequences of optimizing quality measure design and testing.
External Review
This guideline was made available for review June–July 2023 by the APA membership, scientific and clinical experts, allied organizations, and the public. In addition, a number of patient advocacy organizations were invited for input. Forty-seven individuals and 17 organizations submitted comments on the guideline (see the chapter “Individuals and Organizations That Submitted Comments” for a list of the names). The chair and vice-chair of the GWG reviewed and addressed all comments received; substantive issues were reviewed by the GWG.
Funding and Approval
This guideline development project was funded and supported by the APA without any involvement of industry or external funding. The guideline was submitted to the APA Assembly and APA Board of Trustees and approved on November 4, 2023, and December 9, 2023, respectively.