【24h】

Text classification for automatic detection of alcohol use-related tweets: A feasibility study

机译:自动检测与酒精使用相关的推文的文本分类:可行性研究

获取原文

摘要

We present a feasibility study using text classification to classify tweets about alcohol use. Alcohol use is the most widely used substance in the US and is the leading risk factor for premature morbidity and mortality globally. Understanding use patterns and locations is an important step toward prevention, moderation, and control of alcohol outlets. Social media may provide an alternate way to measure alcohol use in real time. This feasibility study explores text classification methodologies for identifying alcohol use tweets. We labeled 34,563 geo-located New York City tweets collected in a 24 hour period over New Year's Day 2012. We preprocessed the tweets into stem/ not stemmed and unigram/ bigram representations. We then applied multinomial naïve Bayes, a linear SVM, Bayesian logistic regression, and random forests to the classification task. Using 10 fold cross-validation, the algorithms performed with area under the receiver operating curve of 0.66, 0.91, 0.93, and 0.94 respectively. We also compare to a human constructed Boolean search for the same tweets and the text classification method is competitive with this hand crafted search. In conclusion, we show that the task of automatically identifying alcohol related tweets is highly feasible and paves the way for future research to improve these classifiers.
机译:我们提出了一项使用文本分类对酒精使用相关推文进行分类的可行性研究。饮酒是美国使用最广泛的物质,并且是全球过早发病和死亡的主要危险因素。了解使用方式和位置是预防,节制和控制酒精出口的重要一步。社交媒体可以提供实时测量酒精使用量的替代方法。这项可行性研究探索了用于识别酒精使用鸣叫的文本分类方法。我们标记了2012年元旦24小时内收集的34,563张地理位置优越的纽约市推文。我们将这些推文进行了预处理,分为词干/非词干和unigram / bigram表示形式。然后,我们将多项式朴素贝叶斯,线性SVM,贝叶斯逻辑回归和随机森林应用于分类任务。使用10倍交叉验证,算法在接收器工作曲线下的面积分别为0.66、0.91、0.93和0.94。我们还比较了人工构造的布尔搜索的相同推文,并且文本分类方法与这种手工搜索相比具有竞争力。总之,我们表明,自动识别与酒精相关的推文的任务是高度可行的,并为将来改进这些分类器的研究铺平了道路。

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号