首页> 外文会议>22nd International Conference on Computational Linguistics >Authorship Attribution and Verification with Many Authors and Limited Data
【24h】

Authorship Attribution and Verification with Many Authors and Limited Data

机译:作者众多且资料有限的作者归属和验证

获取原文
获取原文并翻译 | 示例

摘要

Most studies in statistical or machine learning based authorship attribution focus on two or a few authors. This leads to an overestimation of the importance of the features extracted from the training data and found to be discriminating for these small sets of authors. Most studies also use sizes of training data that are unrealistic for situations in which stylometry is applied (e.g., forensics), and thereby overestimate the accuracy of their approach in these situations. A more realistic interpretation of the task is as an authorship verification problem that we approximate by pooling data from many different authors as negative examples. In this paper, we show, on the basis of a new corpus with 145 authors, what the effect is of many authors on feature selection and learning, and show robustness of a memory-based learning approach in doing authorship attribution and verification with many authors and limited training data when compared to eager learning methods such as SVMs and maximum entropy learning.
机译:基于统计或基于机器学习的作者身份的大多数研究都集中于两位或几位作者。这导致对从训练数据中提取的特征的重要性的高估,并且发现这些特征对这些小批作者是有区别的。大多数研究还使用了训练数据的大小,这些数据对于应用测距法的情况(例如法医)是不切实际的,从而高估了这些方法在这些情况下的准确性。对该任务的更现实的解释是作为作者身份验证问题,我们通过合并来自许多不同作者的数据作为负面示例来进行近似。在本文中,我们基于一个具有145位作者的新语料库,展示了许多作者对特征选择和学习的影响,并展示了基于记忆的学习方法在与多位作者进行作者身份归因和验证方面的鲁棒性与渴望的学习方法(例如SVM和最大熵学习)相比,培训数据有限。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号