首页> 外文期刊>Annals Data Science >Region Based Instance Document (RID) Approach Using Compression Features for Authorship Attribution
【24h】

Region Based Instance Document (RID) Approach Using Compression Features for Authorship Attribution

机译:基于区域的实例文档(RID)方法,使用压缩功能实现作者身份归属

获取原文
获取原文并翻译 | 示例
           

摘要

Authorship attribution is concerned with identifying authors of disputed or anonymous documents, which are potentially conspicuous in legal, criminal/civil cases, threatening letters and terroristic communications also in computer forensics. There are two basic approaches for authorship attribution one is instance based (treat each training text individually) and the other is profile based (treat each training text cumulatively). Both of these methods have their own advantages and disadvantages. The present paper proposes a new region based document model for authorship identification, to address the dimensionality problem of instance based approaches and scalability problem of profile based approaches. The proposed model concatenates a set of individual ‘n’ instance documents of the author as a single region based instance document (RID). On the RID compression based similarity distance method is used. The compression based methods requires no pre-processing and easy to apply. This paper uses Gzip compression algorithm with two compression based similarity measures NCD, CDM. The proposed compression model is character based and it can automatically capture easily non word features such as word stems, punctuations etc. The only disadvantage of compression models is complexity is high. The proposed RID approach addresses this issue by reducing the repeated words in the document. The present approach is experimented on English editorial columns. We achieved approximately 98% of accuracy in identifying the author.
机译:作者归属归因于确定有争议或匿名文件的作者,这些文件在法律,刑事/民事案件中可能是显眼的,在计算机取证中也可能威胁到信件和恐怖主义通信。作者属性的归纳有两种基本方法,一种是基于实例的(分别处理每个培训文本),另一种是基于配置文件的(累积处理每个培训文本)。这两种方法都有其自身的优点和缺点。本文提出了一种用于作者身份识别的基于区域的文档模型,以解决基于实例的方法的维数问题和基于配置文件的方法的可伸缩性问题。提议的模型将作者的单个“ n”个实例文档集合连接为一个基于区域的实例文档(RID)。在RID上使用基于压缩的相似距离方法。基于压缩的方法无需预处理且易于应用。本文将Gzip压缩算法与两种基于压缩的相似性度量NCD,CDM结合使用。所提出的压缩模型是基于字符的,并且可以轻松地自动捕获非词的特征,例如词干,标点符号等。压缩模型的唯一缺点是复杂性很高。建议的RID方法通过减少文档中的重复单词来解决此问题。本方法在英文社论专栏中进行了实验。我们在识别作者方面达到了大约98%的准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号