Region Based Instance Document (RID) Approach Using Compression Features for Authorship Attribution

N. V. Ganapathi Raju; Someswara Rao Chinta

首页> 外文期刊>Annals Data Science >Region Based Instance Document (RID) Approach Using Compression Features for Authorship Attribution

【24h】

Region Based Instance Document (RID) Approach Using Compression Features for Authorship Attribution

机译：基于区域的实例文档（RID）方法，使用压缩功能实现作者身份归属

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially conspicuous in legal, criminal/civil cases, threatening letters and terroristic communications also in computer forensics. There are two basic approaches for authorship attribution one is instance based (treat each training text individually) and the other is profile based (treat each training text cumulatively). Both of these methods have their own advantages and disadvantages. The present paper proposes a new region based document model for authorship identification, to address the dimensionality problem of instance based approaches and scalability problem of profile based approaches. The proposed model concatenates a set of individual ‘n’ instance documents of the author as a single region based instance document (RID). On the RID compression based similarity distance method is used. The compression based methods requires no pre-processing and easy to apply. This paper uses Gzip compression algorithm with two compression based similarity measures NCD, CDM. The proposed compression model is character based and it can automatically capture easily non word features such as word stems, punctuations etc. The only disadvantage of compression models is complexity is high. The proposed RID approach addresses this issue by reducing the repeated words in the document. The present approach is experimented on English editorial columns. We achieved approximately 98% of accuracy in identifying the author.

机译：作者归属归因于确定有争议或匿名文件的作者，这些文件在法律，刑事/民事案件中可能是显眼的，在计算机取证中也可能威胁到信件和恐怖主义通信。作者属性的归纳有两种基本方法，一种是基于实例的（分别处理每个培训文本），另一种是基于配置文件的（累积处理每个培训文本）。这两种方法都有其自身的优点和缺点。本文提出了一种用于作者身份识别的基于区域的文档模型，以解决基于实例的方法的维数问题和基于配置文件的方法的可伸缩性问题。提议的模型将作者的单个“ n”个实例文档集合连接为一个基于区域的实例文档（RID）。在RID上使用基于压缩的相似距离方法。基于压缩的方法无需预处理且易于应用。本文将Gzip压缩算法与两种基于压缩的相似性度量NCD，CDM结合使用。所提出的压缩模型是基于字符的，并且可以轻松地自动捕获非词的特征，例如词干，标点符号等。压缩模型的唯一缺点是复杂性很高。建议的RID方法通过减少文档中的重复单词来解决此问题。本方法在英文社论专栏中进行了实验。我们在识别作者方面达到了大约98％的准确性。

著录项

来源
《Annals Data Science》 |2018年第3期|437-451|共15页
作者
N. V. Ganapathi Raju; Someswara Rao Chinta;
展开▼
作者单位

Department of CSE, Gokaraja Rangaraju Institute of Engineering and Technology;

Department of CSE, SRKR Engineering College;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
GZip compressor; NCD and CDM measures; Authorship identification;

机译：GZip压缩器;NCD和CDM措施;作者身份;

相似文献

外文文献
中文文献
专利

1. A Deep Learning-based Artificial Neural Network Method for Instance-based Arabic Language Authorship Attribution [J] . Mohammad Al-Sarem, Abdullah Alsaeedi, Faisal Saeed International Journal of Advances in Soft Computing and Its Applications . 2020,第2期

机译：基于深入的基于学习的人工神经网络方法，用于基于类似的阿拉伯语作者归因
2. Ensemble Methods for Instance-Based Arabic Language Authorship Attribution [J] . Al-Sarem Mohammed, Saeed Faisal, Alsaeedi Abdullah, Quality Control, Transactions . 2020,第期

机译：基于实例的阿拉伯语作者归因的合奏方法
3. A language-independent authorship attribution approach for author identification of text documents [J] . Ramezani Reza Expert systems with applications . 2021,第Octa期

机译：作者识别文本文件的语言无关的作者归因方法
4. A Comparative Study of Language Modeling to Instance-Based Methods, and Feature Combinations for Authorship Attribution [C] . Olga Fourkioti, Symeon Symeonidis, Avi Arampatzis International conference on theory and practice of digital libraries . 2017

机译：语言建模与基于实例的方法以及作者归因的特征组合的比较研究
5. A Natural Language Processing and Machine-Learning Based Approach to Authorship Attribution of Tweets [D] . Day, Siobahn Caroline. 2018

机译：基于自然语言处理和机器学习的推文作者身份归属方法
6. Cross Lingual Sentiment Analysis: A Clustering-Based Bee Colony Instance Selection and Target-Based Feature Weighting Approach [O] . Mohammed Abbas Mohammed Almansor, Chongfu Zhang, Wasiq Khan, 2020

机译：交叉语言情绪分析：基于聚类的蜂殖民地实例选择和基于目标的特征加权方法
7. $CAG$ : Stylometric Authorship Attribution of Multi-Author Documents Using a Co-Authorship Graph [O] . Raheem Sarwar, Norawit Urailertprasert, Nattapol Vannaboot, 2020

机译：$ CAG $：使用共同作者图形的多作者文件的款式验证归属

Region Based Instance Document (RID) Approach Using Compression Features for Authorship Attribution

摘要

著录项

相似文献

相关主题

期刊订阅