首页> 外文会议>International Conference on Computational Linguistics >Authorship Attribution and Verification with Many Authors and Limited Data
【24h】

Authorship Attribution and Verification with Many Authors and Limited Data

机译:作者归属与许多作者和有限的数据验证

获取原文

摘要

Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of the task is as an authorship verification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a new corpus with 145 authors, what the effect is of many authors on feature selection and learning, and show robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning.
机译:大多数统计或机器学习的作者归属于两个或几个作者侧重于此。这导致高估了从训练数据中提取的功能的重要性,并发现要为这些小组作者区分。大多数研究还使用培训数据的尺寸,这些训练数据对于应用了练习术的情况(例如,取证),从而高估在这些情况下其方法的准确性。对任务的更现实的解释是作为一个由来自许多不同作者的数据作为否定例子来汇集数据的作者验证问题。在本文中,我们以具有145名作者的新语料库显示,其中许多作者在特征选择和学习中的效果是什么,并展示了基于内存的学习方法的鲁棒性,并与许多作者验证与渴望学习方法(如SVM和最大熵学习)相比,培训数据有限。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号