Mining User-Generated Content in an Online Smoking Cessation Community to Identify Smoking Status: A Machine Learning Approach

Research Summary:


Smoking causes about 20% of all deaths in the United States. The majority of smokers want to quit, and over 12 million turned to the internet for information about quitting smoking in 2017. For health promotion programs, it is important to provide tailored interventions, as they exert positive effects on health behavior change and program participation. Users are more likely to attend to content they perceive as being personally relevant, and more likely to remain engaged with interventions that they find satisfying or helpful in achieving their goals. Tailored content is thought to elicit a greater degree of cognitive processing as it is more likely to be read, understood, recalled, rated highly, and perceived as credible compared to one-size-fits-all intervention content. Tailored content may also lead to longer and more robust engagement with an intervention. Specifically for smoking cessation, tailored information – delivered via print and Internet interventions – has been shown to be effective in helping people quit. Development of automated decision support tools that can accurately identify an individual’s smoking status will help designers of Internet cessation interventions better deliver tailored support.

Methods & Data

The study involved data from, a publicly available web-based smoking cessation program. BecomeAnEX was developed in collaboration with the Mayo Clinic Nicotine Dependence Center and has had over 800,000 users register on the site since it was launched in 2008. To register on BecomeAnEX, individuals must agree to the site’s Terms of Use and Privacy Policy. Our analyses focused on blogs and blog comments since they are the most popular communication channels and typically comprise of longer and more elaborate posts from users. Thus, our dataset includes 38,156 blog posts and 316,886 blog comments published by 5435 users in the BecomeAnEX community between January 2012 to May 2015.

Briefly, each post was coded by two domain experts; a study team member served as a tiebreaker for any posts where the two original coders disagreed. Posts were coded for the author’s smoking status at the time the post was written. Available codes were “Clearly smoking,” “Clearly not smoking,” or “Unclear.” Coders were instructed to use inference and make their best guess based on the text and subtext of each post, but to use the “Unclear” code whenever they did not feel confident that a reliable judgment could be made.

In the current study, those manual annotations were recoded into a binary scheme that emphasized accurate classification of abstinence. “Clearly not smoking” posts were the positive class; “Clearly smoking” and “Unclear” posts were combined into a single category serving as the negative class, so that we have a binary classification problem. The two classes were relatively balanced: the positive class constituting 48% (n = 1015) of all the annotated posts, including 44.4% among blogs, and 55.6% among blog comments. The decision to select “Clearly not smoking” as the positive class was based on treatment implications. Specifically, accurately identifying when a smoker has begun a quit attempt (i.e., their first instance of “Clearly not smoking”) would allow a tailored intervention to provide them with relapse prevention support, which qualitatively differs from skills training support that is most appropriate for smokers before they begin a quit attempt. Other applications might be better served by focusing on identification of “Clearly smoking” which could be achieved by adapting the methods applied here.


This proof-of-concept study demonstrated the effectiveness of a new approach to automatically detect individuals’ smoking status from large-scale data in an online smoking cessation community. Our approach went beyond the traditional approach that only examines the content of a user’s post. Instead, we incorporated into the machine-learning-based classifier domain-specific features related to an online smoking cessation community, author-specific features related to patterns of user online engagement, and thread-specific features that signaled abstinence. Adding these novel features improved the classifier’s performance by approximately 10% and pointed to the importance of incorporating domain knowledge, considering characteristics of the author along with preceding and subsequent posts in detecting the smoking status of a focal post’s author.

Human Resources Today