首页> 外文会议>International Conference on Intelligent Computing and Intelligent Systems >Two-Layer Classification and Distinguished Representations of Users and Documents for Grouping and Authorship Identification
【24h】

Two-Layer Classification and Distinguished Representations of Users and Documents for Grouping and Authorship Identification

机译:分组和分组和作者身份证明的双层分类和尊重表示

获取原文

摘要

Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity.
机译:大多数关于作者身份的研究报告了当作者数量超过20-25时的识别结果下降。在本文中,我们介绍了一个新的用户表示来解决这个问题并跨两层分类分类。本文至少有3个新奇。首先,双层方法允许在更多的作者中应用作者身份识别(测试超过100名作者),并且它是可扩展的。作者分为包含较少数量的作者的组。给定匿名文档,主层检测文档所属的组。然后,辅助层确定所选组内的特定作者。为了提取链接类似作者的组,群集应用于用户而不是文档。因此,本文的第二个新颖性正在引入与文档表示不同的新用户表示。如果没有提出的用户表示,则通过文档的群集将导致作者的文档分布在多个群集上,而不是每个作者的单个群集成员资格。第三,当尺寸具有心理背景时,提取的簇对他们的用户进行了描述性和有意义的。对于Autheration标识,文档用提取的组标记,并将其送入机器学习以构建预测给定文档的组和作者的分类模型。结果表明,文档与提取的相应组高度相关,并且可以准确地培训所提出的模型以确定该组和作者身份。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号