Ecological momentary assessment (EMA), also referred to as the experience sampling method (ESM), has been a tool for understanding fluctuating phenomena and within-person dynamics, and the ubiquity of the smartphone has greatly accelerated the accessibility of this method for clinical applications (
67). Programs for the delivery of EMA surveys have become more widely available, and the tools for analysis of intensive longitudinal data have proliferated. At the earlier stages of EMA, the focus was typically on the recording of behaviors (e.g., activity, sleep, smoking) or daily life experiences, such as stressors, through diaries (
68–
70). The data gathered enabled examination of within-person change, but required user input and did little to reduce the biases inherent to self-report (
70). These older assessment strategies had no way to accurately time-stamp the reports that were collected. Anecdotal reports of people arriving 20 minutes early for their appointments and completing 14 days’ worth of assessments are confirmed by the results of research studies comparing reported and observed adherence to paper diary assessments (
71).
The kinds of questions that researchers have been able to ask with these new tools have led to new insights in fundamental questions in mental health. Sometimes these findings are at odds with prevailing theories. It is commonly believed that smokers relapse because of nicotine withdrawal symptoms. Shiffman et al. (
78) evaluated smoking behavior in non-daily smokers and found that negative affect was more important than withdrawal symptoms in relapse, which is critical for understanding which factors to target to sustain smoking cessation. It is commonly believed that suicidal ideation arises from feelings of hopelessness. Kleiman et al. (
79) found that suicidal thoughts varied markedly throughout the day and that variation in candidate predictors (e.g., hopelessness) did not predict the emergence of this ideation, a finding that had been produced previously in a hospitalized sample (
80). Depp et al. (
81) found that social isolation and number of social interactions did not predict onset of suicidal ideation in people with schizophrenia, but that the anticipation of being alone later was associated with an increase in ideation. Granholm et al. (
82) found that people with schizophrenia (N=100) spent considerably more time home and alone than healthy control subjects (N=71) and, even when home and alone, engaged in fewer productive behaviors. In a follow-up analysis of this sample, Strassnig et al. (
83) found that people with schizophrenia reported fewer activities, spent considerably more time sitting and less time standing, and were considerably more likely to sleep during the daytime hours. However, listening to music and watching television were not differentially common in healthy and schizophrenia participants, suggesting that activities less productive than passive recreation are among the things that were more common in participants with schizophrenia.
These are just a few examples from a burgeoning field, highlighting the degree to which active EMA paradigms can be used to advance understanding of the dynamic processes underlying psychiatric diagnoses, extending and sometimes challenging prevailing theories. EMA is a useful strategy to identify targeted features of different conditions on a momentary basis. For example, repeated assessment can identify the proportion of prompts that are answered at home versus away and in the presence of other people versus alone. As these are the central indices of social isolation and social avoidance, the socially relevant impact of negative symptoms in schizophrenia (
85) and current depression in mood disorders can be directly indexed. Research suggests excellent correlations between clinical ratings of symptoms from structured interviews and EMA data, while identifying fluctuations in symptoms that are missed by more widely spaced assessments (
86,
87). These strategies can also be used to examine health-relevant behaviors in mental health populations, as described above. Given the reduced life expectancy associated with severe mental illness and the high prevalence of metabolic syndrome, EMA can be used to estimate the amount of time spent sitting versus standing or otherwise engaged in active behaviors. Given that contemporary EMA can collect the occurrence of multiple different activities since the last survey, it is quite easy to see whether only one activity has occurred since the last survey or whether participants are engaging in multiple concurrent activities, including physical activities (
88). When paired with the passive digital phenotyping described below, a comprehensive EMA assessment can examine location and social context, refine measurements of activity (exercise vs. agitation), detect sleeping during the daytime and not at night, and assess concurrent subjective emotional responses to these activities.
Passive Digital Phenotyping
A more recent breakthrough involves quantifying clinical outcomes using “passive” digital phenotyping (i.e., unobtrusively collecting data via the internal sensors of a smartphone, a wrist-worn smart band, or another device). Passive measures can reduce certain limitations associated with interview- and questionnaire-based clinical assessments (e.g., cognitive impairment, social desirability, cultural biases [
89]). Numerous passive measures have been evaluated in psychiatric populations (e.g., geolocation, accelerometry, ambient speech recorded from the environment, phone call and text logs, screen on/off time, social media activity, Bluetooth-based proximity social sensing) (
90–
96). However, the validity of these passive measures is only beginning to be established. Goldsack et al. (
97) proposed the V3 framework for determining the validity of passive digital biomarkers, which involves three components: verification, analytical validation, and clinical validation. These components, as reviewed below, provide a useful heuristic for determining whether the level of validity achieved for various passive measures meets clinical standards.
The first component of the V3 model, verification (i.e., efficacy), is a quality-control step for the device of interest that is performed by the manufacturer. It occurs before testing is conducted on human subjects. The goal is to determine whether the sensor captures data accurately and to verify that the software accurately outputs data within a predetermined range of values. For example, accelerometry could be verified by placing a smart band on an object programmed to accelerate at a prespecified rate. Verification is typically done by device/software manufacturers against a reference standard. However, the results of these tests and the analytic methods supporting the devices are typically not published or made available for evaluation, which presents replication challenges. Additionally, common standards do not exist for verifying passive digital phenotyping sensors of interest, and sensors embedded in different models will often be different. Since devices and sensors may require differing levels of verification (e.g., required accuracy) for various clinical purposes, evaluating verification data is a critical step that should occur before passive digital phenotyping measures are applied in studies in clinical populations. For medical devices, such as medical decision-making software, this process may be handled by the U.S. Food and Drug Administration (FDA) as part of Good Manufacturing Practice (GMP) standards. Making test results and analytic methods underlying devices accessible to researchers will help disentangle whether failures of replication are true problems with reproducibility across clinical populations or simply differences in the technical quality of different devices used in studies.
The second component, analytical validation (i.e., effectiveness), involves behavioral or physiological validation of a device in human subjects in the real world. A key first step in this process is determining whether sample-level data output by the device is properly received and that algorithms calculated on that data perform as expected. The metric resulting from the algorithm, applied in real time or post hoc, should be evaluated against a relevant reference. Although agreed-upon reference standards have not been determined for validating passive digital phenotyping measures, there has been initial analytical validation of some passive measures. For example, phone-based geolocation and accelerometry recorded on the ExpoApp have been validated in relation to a reference wrist-worn actigraph and a travel/activity diary; time in microenvironments and physical activity from the diary demonstrated high agreement with phone-based geolocation and accelerometry measures (
98). Huang and Onnela (
92) analytically validated a phone accelerometer and gyroscope using a ground-truth standard. They had human participants engage in specific physical activities (e.g., sitting, standing, walking, and ascending and descending stairs) with a phone in their front and back pockets. Behavior was filmed throughout as an objective reference. The sensors accurately predicted video-recorded behavior in the reference standard. One ongoing challenge is that as smartphones are updated with new software and phone models with new sensors, prior validation efforts cannot be assumed to be valid.
The third component, clinical validation (i.e., implementation), involves determining whether the passive digital phenotyping variable of interest adequately predicts a specific clinical outcome within the population of interest. Preliminary evidence for clinical validation exists for several passive measures—although at times results have also been contradictory (
99). For example, in bipolar disorder, incipient depressive symptoms have been predicted by changes in the number of outgoing text messages, the duration of incoming phone calls, geolocation-based mobility measures, and vocal features extracted during phone calls. Manic symptoms of bipolar disorder have been predicted by more outgoing texts and calls, acoustic properties of speech extracted during phone calls (e.g., standard deviation of pitch), and increased movement detected via accelerometry (
100,
101). Clinically elevated and subthreshold depressive symptoms have been predicted by geolocation-derived measures of circadian rhythm, normalized entropy, and location variance, as well as phone usage frequency and speech-derived audio volume (
102–
105). Social anxiety has been predicted by reduced movement on accelerometry and fewer outgoing calls and texts (
106). Relapse of psychotic disorders has been predicted by geolocation mobility metrics and text/call behavior (
90). Negative symptoms of schizophrenia measured via EMA or clinical ratings have been predicted by geolocation-based mobility metrics, voice activity, and actigraphy-based metrics of gesture and activity level (
99,
107–
110). Combining passive measures with EMA surveys may further enhance clinical validation. For example, Raugh et al. (
111) found that the combination of geolocation and EMA surveys was a stronger predictor of clinically rated negative symptoms in schizophrenia than either measure alone. Similarly, Faurholt-Jepsen et al. (
101) found that combining vocal acoustic features extracted from phone calls with EMA reports improved the correct classification of mixed or manic mood states in bipolar disorder beyond either measure alone. Henson et al. (
112) reported that a combination of EMA and passive data, when analyzed for congruence with anomaly detection methods, was associated with early warnings of relapse in people with schizophrenia. Thus, studies suggest that passive measures are promising tools for measuring clinical outcomes. However, there are numerous inconsistencies regarding the predictive value of specific metrics and measures for classifying individual disorders or symptom states, including geolocation, accelerometry, ambient speech, and ambulatory psychophysiology (
113–
116). For example, clinical data on sleep did not match sensor report in one study (
94), and results are not comparable across studies because of differences in sensors utilized, in the clinical targets, in time frames for calculating associations across assessment modalities (e.g., daily or monthly), and in the populations studied. There are also fundamental differences across studies in methods and analyses, such as controlling for multiple comparisons when examining correlational data.
Clinical validation (i.e., implementation) is of particular concern for using passive measures as outcomes for clinical interventions. Unlike traditional interview- or questionnaire-based clinical outcome measures, standards for the level of psychometric evidence needed to say that a measure is clinically validated have not yet been determined for passive digital phenotyping. Proprietary data collection via devices (e.g., a custom wearable device [
117]) and proprietary methods for analysis (e.g., a custom machine learning algorithm [
118]) offer both innovation and a challenge to reproducible clinical research. Further complexity arises from the trend toward using more complex analytic methods with passive digital phenotyping because of the multilevel nature of the data. For example, machine learning is an increasingly common tool in the clinical validation process, and studies have employed various algorithms to predict a range of clinical outcomes (e.g., classification, regression, unsupervised clustering) (
119). However, common standards for judging the level of psychometric evidence that constitutes clinical validation for machine learning are not yet uniformly applied across the field. Is predictive accuracy of 70% enough to declare clinical validation, or should a higher standard be set (e.g., 90% accuracy) (
104,
106)? Similar considerations affect simpler analytic methods, such as simple correlations for passive data aggregated across a range of time (e.g., 1 week) to form a single value that can be correlated with clinical outcomes. It seems important that such aggregated values be adjusted for the extent of daily or time of day variation. These adjusted correlations tend to be statistically significant but lower (r values ∼0.3–0.5) than typical standards for convergent validity that would be applied within clinical rating scales or questionnaires (e.g., r values >0.80) (
103,
104,
111). Do these lower correlations reflect inadequate convergent validity, even though they are statistically significant? Or is the lower correlation to be expected (and therefore acceptable) because of the fact that it averages across differences in temporal variation across measures or method variance? We suggest that common guidelines for judging what constitutes clinical validation are clearly needed for passive digital phenotyping. There should also be an effort to ensure that clinical validation studies include a representative sample with diverse individuals to ensure that algorithms are not primarily trained to be accurate in populations whose demographic and personal characteristics do not overlap with the clinical populations of interest and that methodological and analytic approaches are valid and consistent throughout the population.
Feasibility of implementation is the next consideration, and barriers and facilitators such as cost, accessibility, tolerability, ease of use, and data failure rates are among the relevant factors. Few studies have evaluated user experience of interactions with passive measures. However, qualitative studies employing interviews designed to assess patient perceptions have indicated that while many see these technologies as holding promise for clinical detection and self-management, there may also be unintended barriers to use, such as increased stigma or anxiety (
120,
121). One would expect that most passive measures would not be viewed by participants as burdensome, given that they are collected unobtrusively by the background sensors of their device and do not require direct participant action. However, there may be some instances where device interface proves problematic in clinical populations. For example, in a study on outpatients with chronic schizophrenia, the participants had considerable difficulty with remedying Bluetooth unpairing of a smart band and smartphone (
112). People with schizophrenia found this pairing issue more burdensome than did control subjects. It is also unclear whether certain clinical symptoms interact with the willingness to consent to participating in digital phenotyping studies. For example, by their nature, continuous geolocation and ambient speech monitoring raise questions about privacy and agency. It is unclear whether clinical populations, such as individuals with schizophrenia who have delusions of suspicion, experience such technologies as intrusive and whether they exacerbate symptoms or result in the individual not consenting to participate out of fear of being monitored. Some data suggest that the prevalence of answering prompts while acknowledging concurrent psychotic symptoms is reasonably high (
86) and that EMA reports of location have been validated using GPS coordinates (
108). More generally, issues of systemic racism and mistrust of how passive digital phenotyping information could be (mis)used by the law enforcement or other systems of power may influence implementation of these methods in participants who are racial minorities. Thus, user experience should be carefully evaluated when administering these technologies in clinical populations. As we mention below, the general issue of access to the Internet and experience with any technology is a barrier that will need continuous attention.
Combinations of EMA and passive digital phenotyping seem likely to improve interventions and assessment. GPS location coordinates provide information about where one is, but not who is with them. Proximity detection can determine whether another individual with a device is present, and ambient sound sampling can tell whether individuals are interacting or are simply in proximity to each other. Smart bands can detect activity but not the motivation for the activity (exercise vs. agitation). Combining mood sampling with geolocation information and EMA can help determine whether social isolation is due to depression or lack of motivation, and facial and vocal affect assessment from participant-captured samples can provide validation information for mood reports. A recent example (
122) suggested that the combination of passive phenotyping and EMA prompts was feasible, with multiple different prompted responses collected, in conjunction with data regarding location, psychophysiological responses, and ambulatory acoustics (44 participants with schizophrenia and 19 with bipolar disorder). Thus, an array of different elements of functioning can be captured simultaneously and used to generate a wide-ranging picture of momentary functioning.
A challenge in the domain of passive digital phenotyping is that application developers and scientific utilizers are commonly at the mercy of the manufacturers, who can restrict access to phone features for applications or push out operating system upgrades that cause software to fail. Further, applications that monitor access to social media may also encounter restricted access or requirements that access be granted for each time the application attempts to capture data. This is an area where collaboration with manufacturers will be required.