首页> 外文期刊>Information Processing & Management >Authorship attribution based on a probabilistic topic model
【24h】

Authorship attribution based on a probabilistic topic model

机译:基于概率主题模型的作者归属

获取原文
获取原文并翻译 | 示例
       

摘要

This paper describes, evaluates and compares the use of Latent Dirichlet allocation (LDA) as an approach to authorship attribution. Based on this generative probabilistic topic model, we can model each document as a mixture of topic distributions with each topic specifying a distribution over words. Based on author profiles (aggregation of all texts written by the same writer) we suggest computing the distance with a disputed text to determine its possible writer. This distance is based on the difference between the two topic distributions. To evaluate different attribution schemes, we carried out an experiment based on 5408 newspaper articles (Glasgow Herald) written by 20 distinct authors. To complement this experiment, we used 4326 articles extracted from the Italian newspaper La Stampa and written by 20 journalists. This research demonstrates that the LDA-based classification scheme tends to outperform the Delta rule, and the x~2 distance, two classical approaches in authorship attribution based on a restricted number of terms. Compared to the Kull-back-Leibler divergence, the LDA-based scheme can provide better effectiveness when considering a larger number of terms.
机译:本文描述,评估和比较了潜在的狄利克雷分配(LDA)作为作者身份归属的一种方法。基于此生成概率主题模型,我们可以将每个文档建模为主题分布的混合,每个主题指定单词的分布。根据作者简介(同一位作者撰写的所有文本的总和),我们建议计算有争议文本的距离,以确定可能的作者。该距离基于两个主题分布之间的差异。为了评估不同的归因方案,我们基于20位不同作者撰写的5408篇报纸文章(《格拉斯哥先驱报》)进行了一项实验。为了补充该实验,我们使用了来自意大利报纸La Stampa的4326篇文章,并由20名记者撰写。这项研究表明,基于LDA的分类方案倾向于优于Delta规则和x〜2距离,这是基于有限数量术语的两种经典著作权归属方法。与Kull-back-Leibler散度相比,当考虑大量项时,基于LDA的方案可以提供更好的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号