Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation

Tareef Kamil Mustafa; Norwati Mustapha; Masrah Azrifah Azmi; Nasir B. Sulaiman

首页> 外文期刊>Journal of computer sciences >Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation

【24h】

Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation

机译：删除最大项目集：改进用于作者调查的文本挖掘中的风格作者归属算法

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

Problem statement: Stylometric authorship attribution is an approach concerned about analyzing texts in text mining, e.g., novels and plays that famous authors wrote, trying to measure the authors style, by choosing some attributes that shows the author style of writing, assuming that these writers have a special way of writing that no other writer has; thus, authorship attribution is the task of identifying the author of a given text. In this study, we propose an authorship attribution algorithm, improving the accuracy of Stylometric features of different professionals so it can be discriminated nearly as well as fingerprints of different persons using authorship attributes. Approach: The main target in this study is to build an algorithm supports a decision making systems enables users to predict and choose the right author for a specific anonymous author's novel under consideration, by using a learning procedure to teach the system the Stylometric map of the author and behave as an expert opinion. The Stylometric Authorship Attribution (AA) usually depends on the frequent word as the best attribute that could be used, many studies strived for other beneficiary attributes, still the frequent word is ahead of other attributes that gives better results in the researches and experiments and still the best parameter and technique that's been used till now is the counting of the bag-of-word with the maximum item set. Results: To improve the techniques of the AA, we need to use new pack of attributes with a new measurement tool, the first pack of attributes we are using in this study is the (frequent pair) which means a pair of words that always appear together, this attribute clearly is not a new one, but it wasn't a successive attribute compared with the frequent word, using the maximum item set counters, the words pair made some mistakes as we see in the experiment results, improving the winnow algorithm by combining it with the computational approach, achieved by using the CV statistical tool as a conditional threshold for attribute selecting; by doing so, the frequent pair result improved from 50% error to 0% in the improved frequent pair with a clear higher score result compared with the frequent word attribute. Conclusion/Recommendations: The new CV algorithm results improvement may lead to several new attributes usage that gave unsatisfying results before that might improve the direction for solving some hard cases couldn't be solved till now.

机译：问题陈述：风格法作者身份归因是一种在文本挖掘中分析文本的方法，例如著名作家所写的小说和戏剧，试图通过选择一些能够显示作者写作风格的属性来衡量作者风格，并假设这些作者具有其他作家所没有的特殊写作方式；因此，作者身份归属是确定给定文本作者的任务。在这项研究中，我们提出了一种作者身份归因算法，提高了不同专业人士的笔迹特征的准确性，因此可以使用作者属性来区分几乎相同的指纹以及不同人员的指纹。方法：本研究的主要目标是构建一个支持决策系统的算法，该算法使用户能够通过学习系统来学习该系统的风格图来预测和选择正在考虑的特定匿名作者小说的合适作者。作者并以专家的意见行事。程式设计作者身份归因（AA）通常取决于常用字词作为可以使用的最佳属性，许多研究都在争取其他受益人属性，但常用字词仍领先于其他属性，在研究和实验中效果更好，到目前为止，使用的最佳参数和技术是对具有最大项目集的单词袋的计数。结果：为了改进AA的技术，我们需要使用带有新度量工具的新属性包，本研究中使用的第一组属性是（频繁对），这意味着始终出现一对单词在一起，该属性显然不是一个新属性，但是与频繁单词相比，它不是连续属性，使用最大项目集计数器，单词对在实验结果中看到了一些错误，从而改进了Winnow算法通过将其与使用CV统计工具作为属性选择的条件阈值的计算方法相结合；通过这样做，与频繁单词属性相比，改进后的频繁对中的频繁对结果从50％错误提高到0％，具有明显更高的分数结果。结论/建议：新的CV算法结果改进可能会导致使用一些新属性，这些属性给出了不令人满意的结果，然后才可能改善解决目前尚无法解决的困难案例的方向。

著录项

来源
《Journal of computer sciences》 |2010年第3期|p.235-243|共9页
作者
Tareef Kamil Mustafa; Norwati Mustapha; Masrah Azrifah Azmi; Nasir B. Sulaiman;
展开▼
作者单位

Faculty of Computer Science and Information Technology, University Putra Malaysia,P.O. Box 43400, UPM Serdang, Selangor, Malaysia;

rnFaculty of Computer Science and Information Technology, University Putra Malaysia,P.O. Box 43400, UPM Serdang, Selangor, Malaysia;

rnFaculty of Computer Science and Information Technology, University Putra Malaysia,P.O. Box 43400, UPM Serdang, Selangor, Malaysia;

rnFaculty of Computer Science and Information Technology, University Putra Malaysia,P.O. Box 43400, UPM Serdang, Selangor, Malaysia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
text mining; stylometric attribution; authorship attribution; winnow algorithm; computational stylistic;

机译：文本挖掘;风格归因;作者身份归属;Winnow算法;计算风格;
入库时间 2022-08-17 13:48:28

相似文献

外文文献
中文文献
专利

1. Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation | Science Publications [J] . Masrah A. Azmi, Nasir B. Sulaiman, Norwati Mustapha, Journal of computer sciences . 2010,第3期

机译：删除最大项目集：改进用于作者调查的文本挖掘中的风格著作权归属算法科学出版物
2. Authorship Attribution of Short Historical Arabic Texts using Stylometric Features and a KNN Classifier with Limited Training Data [J] . Fatma Howedi, Masnizah Mohd, Zahra Aborawi Aborawi, Journal of computer sciences . 2020,第10期

机译：短期阿拉伯语文本的作者归属使用仪表特征和具有有限培训数据的KNN分类器
3. Authorship Attribution of Short Historical Arabic Texts using Stylometric Features and a KNN Classifier with Limited Training Data [J] . Fatma Howedi, Masnizah Mohd, Zahra Aborawi Aborawi, Journal of computer sciences . 2020,第10期

机译：短期阿拉伯语文本的作者归属使用仪表特征和KNN分类器，具有有限的培训数据
4. Stylometric Features for Authorship Attribution of Polish Texts [C] . Piotr Szwed International conference on artificial intelligence and soft computing . 2017

机译：波兰语著作的作者归属的风格特征
5. Stylometric Authorship Attribution Techniques and Analysis for Collaborative Platforms [D] . Dauber , Edwin George, Jr. 2020

机译：协作平台的款式作者归属技术与分析
6. Authorship attribution of source code by using back propagation neural network based on particle swarm optimization [O] . Xinyu Yang, Guoai Xu, Qi Li, 2011

机译：基于粒子群算法的反向传播神经网络对源代码的作者归属
7. Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation [O] . Tareef K. Mustafa, Norwati Mustapha, Masrah A. Azmi, 2010

机译：删除最大项目集：改进用于作者调查的文本挖掘中的风格作者归属算法

Dropping down the Maximum Item Set: Improving the Stylometric Authorship Attribution Algorithm in the Text Mining for Authorship Investigation

摘要

著录项

相似文献

相关主题

期刊订阅