There are at least 10,000 smartphone apps targeting mental health (
1), and many patients are exploring them (
2). The literature on health app ratings offers tools to help clinicians and patients pick apps (
3), but so far none of these tools provides a reliable method of evaluating an app’s safety and usefulness. Although simple metrics —for example, a five-star rating of a health app by a user—may appear to be a useful metric of quality, a study of 137 patient-facing apps found that star-based ratings had low correlation with the apps’ clinical utility or usability (
4). Clinician ratings of individual features of mental health apps also suffer from low interrater reliability, as demonstrated in a study using existing app rating metrics to evaluate popular smoking cessation and depression apps (
5).
The inherently dynamic nature of apps adds to the challenge of developing reliable metrics of app quality. A study tracking the longitudinal availability of mental health apps reported that they have a half-life: after a certain amount of time, an app may no longer be available for public use (
6). App creators have liberty to update apps as much or as little as they would like—some creators frequently update apps, whereas others completely abandon support and development of an app.
A further challenge in using app ratings is their use of absolute scores versus relative scores. Just as there is no single ‘A+’ rated therapy or medication treatment plan that is “100% effective” for all patients, apps vary in effectiveness depending on the individual user. Apps are tools that must be selected on the basis of individual needs, abilities, preferences, and many other personal patient factors.
In this column, we describe the rationale, internal testing, and release of the American Psychiatric Association’s (APA) smartphone app evaluation framework. The APA framework serves as a tool to guide informed decision making and evaluation of apps, and, like any rubric, it must be reapplied for each unique patient, unique clinical context, and unique version of the app.
A Framework for Evaluation
The APA app evaluation framework offers clinicians and patients an adaptable scaffold for informed decision making. In significant ways, the framework approach adopted by the APA is unique compared with prior efforts. Instead of directly rating or scoring a particular app, the framework utilizes a simple four-stage hierarchical process, asking users to consider safety and privacy first, followed by evidence and benefit, engagement, and, finally, interoperability (
Figure 1). [A color schematic of the APA app evaluation framework is available as an
online supplement to this column.]
When the framework is used as intended, evaluation begins with safety and privacy, progressing to the next stage only if the particular app in question satisfies the present clinical needs surrounding that stage. For example, if an app does not satisfy the present clinical needs around privacy and safety, evaluation should stop there. The APA app evaluation framework does not offer specific criteria to judge whether an app satisfies each stage of the hierarchy. Instead, it offers a series of questions that are intended to guide a unique and personalized determination of the appropriateness of an app for each patient. Users can choose how to weigh or consider each stage, given that certain stages, such as data sharing, may not matter if the app is purely informational. The framework is an evolving tool that will be updated to reflect new knowledge about apps. The latest version is freely available through the APA Web site (
https://psychiatry.org/psychiatrists/practice/mental-health-apps/app-evaluation-model).
In order to understand the rationale for the selection and ordering of the hierarchical stages, it is useful to briefly explore the current state of mental health apps. At first glance, it may be difficult to imagine how apps cause harm; yet there is a growing literature on potential dangers of app use. Because many apps exist outside the scope of federal privacy laws (for example, HIPAA), given that they are marketed directly to consumers, apps can be used to collect the personal mental health data of app users, and these data can often be sold, traded, marketed, and indefinitely stored by app companies. Evidence also suggests that the majority of health apps currently lack even basic privacy policies, meaning that simply checking for the existence of a privacy policy will help identify many questionable apps (
7). Beyond privacy concerns, apps have been known to offer dangerous and harmful advice (
8).
When evaluating efficacy, it is important to realize that although many apps appear useful, the actual evidence for clinical efficacy is nascent. This does not mean apps cannot be helpful, but it highlights the importance of considering whether the current evidence for the app in question is sufficient or relevant for a particular patient. Together, stages 1 and 2 of the framework (risk/privacy and safety, and evidence and benefit) constitute basic medical decision making centered on nonmaleficence.
The engagement stage represents the growing awareness that many patients do not stick with apps or may find them difficult to use (
9). This likely reflects the lack of patient involvement in the development of mental health apps.
Data sharing, the final stage, reflects the need to ensure that app data are available to the treatment team. Poor interoperability can fragment care by limiting appropriate data sharing and access to information that is necessary to guide care and make treatment decisions.
Preliminary Internal Testing
Like any framework and tool for informed decision making, the APA app evaluation framework will evolve based on user feedback and evaluation. In order to gain an early understanding of the reliability of the framework and generate foundational data, we conducted internal interrater reliability testing with five psychiatrists (JBT, SRC, SYG, JWK, and TN). Each psychiatrist was presented with three mood tracking apps (MoodTrack, MoodTools, and T2 Mood Tracker) that closely duplicate apps used in a recent study of app usability among patients with depression (
9). The psychiatrists were asked to rate the app for use in two clinical situations, using the app evaluation framework to rate the app at all four stages.
The first clinical case involved a patient “who is tech savvy, in his twenties, suffering from moderate depression, without suicidal ideation, and interested in using an app to monitor mood while on a selective serotonin reuptake inhibitor.” The second clinical case involved a patient “who is less tech savvy but owns a smartphone her daughter gave her, in her late sixties, and suffering from moderate depression. She has two apps on her phone that she rarely uses but would like to monitor her mood while on a selective serotonin reuptake inhibitor.”
The psychiatrists were not provided any further information about the cases. Each reviewer downloaded the apps in October 2016 and were instructed to use each one for at least 15 minutes before reviewing it as well as to search for any research studies on the apps. We analyzed concordance among all five raters in ratings of each stage of the framework. We used a Kendall's coefficient of concordance of greater than .667 to indicate agreement among the raters.
The Kendall’s coefficient of concordance was .93 (p≤.01) for the risk/privacy and safety stage, .95 (p<.01) for evidence and benefit, .67 (p≤.01) for engagement, and .77 (p<.01). for interoperability.
Conclusions
An evaluation framework for informed decision making is a useful solution to the current challenges involved in ratings of apps. In presenting initial and internal reliability metrics of the APA app evaluation framework, we underscore the potential of this simple four-stage hierarchical process model—as well as opportunities to improve it. Although this column focuses on a depression example, we note that this framework is intended for use with patients and apps focused on other conditions, such as schizophrenia (
2). For patients with lower literacy, impaired cognition, and apathy the same evaluation process and stages are equally important and relevant. In an effort to better understand how clinicians use this model and to gain data for its further improvement, we have recently begun allowing users to share their app evaluations on the APA Web site.
App evaluation is a complex process involving the input of numerous stakeholder groups (
10). Although these initial efforts were developed and tested with psychiatrists, efforts are under way to incorporate diverse stakeholder input into this framework, including the voices of patients and family members. Like apps themselves, app ratings are a dynamic and evolving process. We hope the APA efforts presented here will stimulate discussion and encourage informed decision making around using apps in clinical care.