首页> 外文会议>Australasian Joint Conference on Artificial Intelligence >Improving Authorship Attribution in Twitter Through Topic-Based Sampling
【24h】

Improving Authorship Attribution in Twitter Through Topic-Based Sampling

机译:通过基于主题的采样提高Twitter中的作者归属

获取原文

摘要

Aliases are used as a means of anonymity on the Internet in environments such as IRC (internet relay chat), forums and micro-blogging websites such as Twitter. While there are genuine reasons for the use of aliases, such as journalists operating in politically oppressive countries, they are increasingly being used by cybercriminals and extremist organisations. In recent years, we have seen increased research on authorship attribution of Twitter messages, including authorship analysis of aliases. Previous studies have shown that anti-aliasing of randomly generated sub-aliases yields high accuracies when linking the sub-aliases, but become much less accurate when topic-based sub-aliases are used. N-gram methods have previously been demonstrated to perform better than other methods in this situation. This paper investigates the effect of topic-based sampling on authorship attribution accuracy for the popular micro-blogging website Twitter. Features are extracted using character n-grams, which accurately capture differences in authorship style. These features are analysed using support vector machines using a one-versus-all classifier. The predictive performance of the algorithm is then evaluated using two different sampling methodologies - authors that were sampled through a context-sensitive topic-based search and authors that were sampled randomly. Topic-based sampling of authors is found to produce more accurate authorship predictions. This paper presents several theories as to why this might be the case.
机译:别名被用作Internet上的透露手称的手段,例如IRC(互联网中继聊天),论坛和微博网站(如Twitter)。虽然存在使用别名的原因,例如在政治上的压抑国家操作的记者,但越来越多地被网络犯罪分子和极端主义组织使用。近年来,我们已经看到了关于Twitter消息的作者归属的研究,包括别名的作者分析。以前的研究表明,当使用基于主题的子别名时,随机生成的子别名的抗锯齿会产生高精度,而是在使用基于主题的子别名时变得更加准确。先前已经证明了N-GRAM方法以比这种情况更好地表现优于其他方法。本文调查了基于主题的抽样对流行的微博博客网站推特的作者归因准确性的影响。使用字符n-gram提取功能,可以精确地捕获作者风格的差异。使用一个与所有分类器使用支持向量机进行分析这些功能。然后,使用两种不同的采样方法评估算法的预测性能 - 通过随机采样的基于上下文敏感的主题搜索和作者进行采样的作者来评估算法的预测性能。基于主题的作者采样被发现产生更准确的作者预测。本文提出了几种理论,为什么这可能是这种情况。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号