Two-Layer Classification and Distinguished Representations of Users and Documents for Grouping and Authorship Identification

机译：分组和分组和作者身份证明的双层分类和尊重表示

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Most studies on authorship identification reported a drop in the identification result when the number of authors exceeds 20-25. In this paper, we introduce a new user representation to address this problem and split classification across two layers. There are at least 3 novelties in this paper. First, the two-layer approach allows applying authorship identification over larger number of authors (tested over 100 authors), and it is extendable. The authors are divided into groups that contain smaller number of authors. Given an anonymous document, the primary layer detects the group to which the document belongs. Then, the secondary layer determines the particular author inside the selected group. In order to extract the groups linking similar authors, clustering is applied over users rather than documents. Hence, the second novelty of this paper is introducing a new user representation that is different from document representation. Without the proposed user representation, the clustering over documents will result in documents of author(s) distributed over several clusters, instead of a single cluster membership for each author. Third, the extracted clusters are descriptive and meaningful of their users as the dimensions have psychological backgrounds. For authorship identification, the documents are labelled with the extracted groups and fed into machine learning to build classification models that predicts the group and author of a given document. The results show that the documents are highly correlated with the extracted corresponding groups, and the proposed model can be accurately trained to determine the group and the author identity.

机译：大多数关于作者身份的研究报告了当作者数量超过20-25时的识别结果下降。在本文中，我们介绍了一个新的用户表示来解决这个问题并跨两层分类分类。本文至少有3个新奇。首先，双层方法允许在更多的作者中应用作者身份识别（测试超过100名作者），并且它是可扩展的。作者分为包含较少数量的作者的组。给定匿名文档，主层检测文档所属的组。然后，辅助层确定所选组内的特定作者。为了提取链接类似作者的组，群集应用于用户而不是文档。因此，本文的第二个新颖性正在引入与文档表示不同的新用户表示。如果没有提出的用户表示，则通过文档的群集将导致作者的文档分布在多个群集上，而不是每个作者的单个群集成员资格。第三，当尺寸具有心理背景时，提取的簇对他们的用户进行了描述性和有意义的。对于Autheration标识，文档用提取的组标记，并将其送入机器学习以构建预测给定文档的组和作者的分类模型。结果表明，文档与提取的相应组高度相关，并且可以准确地培训所提出的模型以确定该组和作者身份。

著录项

来源
《International Conference on Intelligent Computing and Intelligent Systems》|2009年||共7页
会议地点
作者
Haytham Mohtasseb; Amr Ahmed;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18-53;
关键词
Authorship identification; Similarity detection; Personal blogs; Users lexicon and representation; Keywords extraction;

机译：作者身份证明;相似检测;个人博客;用户词典和表示;关键字提取;

相似文献

外文文献
中文文献
专利

1. Impact of incomplete correspondence between document titles and texts on users' representations: A cognitive and linguistic analysis based on 25 technical documents [J] . Helene Eyrolle, Jacques Virbel, Julie Lemarie Applied Ergonomics . 2008,第2期

机译：文档标题和文本之间不完整对应关系对用户表示的影响：基于25个技术文档的认知和语言分析
2. Authoring social reality with documents: From authorship of documents and documentary boundary objects to practical authorship [J] . Huvila Isto The Journal of Documentation . 2019,第1期

机译：用文档创作社会现实：从文档的创作和文档的边界对象到实际的创作
3. Patent Issued for Print Management System for Retaining Documents with Multiple Users Identifications [J] . Journal of Engineering . 2013,第12期

机译：为打印管理系统颁发了专利，用于保留具有多个用户标识的文档
4. Two-layer classification and distinguished representations of users and documents for grouping and authorship identification [C] . Mohtasseb Haytham, Ahmed Amr IEEE International Conference on Intelligent Computing and Intelligent Systems;ICIS 2009 . 2009

机译：用户和文档的两层分类和区别表示，用于分组和作者身份标识
5. Computer-aided Semantic Signature Identification and Document Classification via Semantic Signatures. [D] . Para, Uday Kiran. 2010

机译：通过语义签名的计算机辅助语义签名识别和文档分类。
6. Authorship identification of documents with high content similarity [O] . Andi Rexha, Mark Kröll, Hermann Ziak, -1

机译：内容相似度高的文档的作者身份标识
7. Two-layer classification and distinguished representations of users and documents for grouping and authorship identification [O] . Mohtasseb Haytham, Ahmed Amr 2009

机译：用户和文档的两层分类和可区分的表示形式，用于分组和作者身份识别

Two-Layer Classification and Distinguished Representations of Users and Documents for Grouping and Authorship Identification

摘要

著录项

相似文献

相关主题

期刊订阅