User generated content (UGC) is a valuable but underutilized source of information about individuals who participate in online cessation interventions. This study represents a first effort to passively detect smoking status among members of an online cessation program using UGC.
Secondary data analysis was performed on data from 826 participants in a web-based smoking cessation randomized trial that included an online community. Domain experts from the online community reviewed each post and comment written by participants and attempted to infer the author’s smoking status at the time it was written. Inferences from UGC were validated by comparison with self-reported 30-day point prevalence abstinence (PPA). Following validation, the impact of this method was evaluated across all individuals and time points in the study period.
Of the 826 participants in the analytic sample, 719 had written at least one post from which content inference was possible. Among participants for whom unambiguous smoking status was inferred during the 30 days preceding their 3-month follow-up survey, concordance with self-report was almost perfect (kappa = 0.94). Posts indicating abstinence tended to be written shortly after enrollment (median = 14 days).
Passive inference of smoking status from UGC in online cessation communities is possible and highly reliable for smokers who actively produce content. These results lay the groundwork for further development of observational research tools and intervention innovations.
A proof-of-concept methodology for inferring smoking status from user generated content in online cessation communities is presented and validated. Content inference of smoking status makes a key cessation variable available for use in observational designs. This method provides a powerful tool for researchers interested in online cessation interventions and establishes a foundation for larger scale application via machine learning.