首页> 外文会议>International conference on Asian-Pacific digital libraries >Identification of Tweets that Mention Books: An Experimental Comparison of Machine Learning Methods
【24h】

Identification of Tweets that Mention Books: An Experimental Comparison of Machine Learning Methods

机译:提及图书的推文的识别:机器学习方法的实验比较

获取原文

摘要

In this paper, we address the task of the identification of tweets on Twitter that mention books (TMB) among tweets that contain the same strings as full book titles. Although this task can be treated as a kind of Named Entity Recognition, the fact that book titles consist of ordinary expressions (such as "The Girl on the Train") makes the task harder. Furthermore, if tweets are gathered through a dictionary-based search, the tweets that contain the same strings as full book titles are often spam. However, assuming a complete list of book titles (i.e. from a union catalogue from a library or commercial bibliographic data from a book store), this task can be solved by text classification. Thus, we proposed a two-step pipeline consisting of spam filtering and TMB classification based on supervised learning with a small amount of labelled data. We constructed optimal classifiers by comparing combinations of four proven supervised learning methods with different features. Given the difficulty of the task, our pipeline performed highly (about 0.7 in terms of F-score).
机译:在本文中,我们解决了在Twitter上标识包含与完整书名相同的字符串的推文中提及书籍(TMB)的推文的任务。尽管可以将此任务视为一种命名实体识别,但是书名由普通表达(例如“火车上的女孩”)组成的事实使该任务更加困难。此外,如果通过基于字典的搜索收集推文,则包含与完整书名相同的字符串的推文通常是垃圾邮件。但是,假设书名的完整列表(即来自图书馆的联合目录或来自书店的商业书目数据),则可以通过文本分类来解决此任务。因此,我们提出了一个基于垃圾邮件过滤和TMB分类的两步式管道,该管道基于监督学习和少量标记数据。通过比较四种经过验证的具有不同功能的监督学习方法的组合,我们构建了最佳分类器。考虑到任务的难度,我们的管道运行良好(以F分数计约为0.7)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号