首页> 外文期刊>Journal of computer sciences >Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation
【24h】

Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation

机译:删除最大项目集:改进用于作者调查的文本挖掘中的风格作者归属算法

获取原文
获取原文并翻译 | 示例
       

摘要

Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers have a special way of writing that no other writer has; thus, authorship attribution is the task of identifying the author of a given text. In this study, we propose an authorship attribution algorithm, improving the accuracy of Stylometric features of different professionals so it can be discriminated nearly as well as fingerprints of different persons using authorship attributes. Approach: The main target in this study is to build an algorithm supports a decision making systems enables users to predict and choose the right author for a specific anonymous author's novel under consideration, by using a learning procedure to teach the system the Stylometric map of the author and behave as an expert opinion. The Stylometric Authorship Attribution (AA) usually depends on the frequent word as the best attribute that could be used, many studies strived for other beneficiary attributes, still the frequent word is ahead of other attributes that gives better results in the researches and experiments and still the best parameter and technique that's been used till now is the counting of the bag-of-word with the maximum item set. Results: To improve the techniques of the AA, we need to use new pack of attributes with a new measurement tool, the first pack of attributes we are using in this study is the (frequent pair) which means a pair of words that always appear together, this attribute clearly is not a new one, but it wasn't a successive attribute compared with the frequent word, using the maximum item set counters, the words pair made some mistakes as we see in the experiment results, improving the winnow algorithm by combining it with the computational approach, achieved by using the CV statistical tool as a conditional threshold for attribute selecting; by doing so, the frequent pair result improved from 50% error to 0% in the improved frequent pair with a clear higher score result compared with the frequent word attribute. Conclusion/Recommendations: The new CV algorithm results improvement may lead to several new attributes usage that gave unsatisfying results before that might improve the direction for solving some hard cases couldn't be solved till now.
机译:问题陈述:风格法作者身份归因是一种在文本挖掘中分析文本的方法,例如著名作家所写的小说和戏剧,试图通过选择一些能够显示作者写作风格的属性来衡量作者风格,并假设这些作者具有其他作家所没有的特殊写作方式;因此,作者身份归属是确定给定文本作者的任务。在这项研究中,我们提出了一种作者身份归因算法,提高了不同专业人士的笔迹特征的准确性,因此可以使用作者属性来区分几乎相同的指纹以及不同人员的指纹。方法:本研究的主要目标是构建一个支持决策系统的算法,该算法使用户能够通过学习系统来学习该系统的风格图来预测和选择正在考虑的特定匿名作者小说的合适作者。作者并以专家的意见行事。程式设计作者身份归因(AA)通常取决于常用字词作为可以使用的最佳属性,许多研究都在争取其他受益人属性,但常用字词仍领先于其他属性,在研究和实验中效果更好,到目前为止,使用的最佳参数和技术是对具有最大项目集的单词袋的计数。结果:为了改进AA的技术,我们需要使用带有新度量工具的新属性包,本研究中使用的第一组属性是(频繁对),这意味着始终出现一对单词在一起,该属性显然不是一个新属性,但是与频繁单词相比,它不是连续属性,使用最大项目集计数器,单词对在实验结果中看到了一些错误,从而改进了Winnow算法通过将其与使用CV统计工具作为属性选择的条件阈值的计算方法相结合;通过这样做,与频繁单词属性相比,改进后的频繁对中的频繁对结果从50%错误提高到0%,具有明显更高的分数结果。结论/建议:新的CV算法结果改进可能会导致使用一些新属性,这些属性给出了不令人满意的结果,然后才可能改善解决目前尚无法解决的困难案例的方向。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号