首页> 外文会议>World Congress on Engineering >Mining Online Diaries for Blogger Identification
【24h】

Mining Online Diaries for Blogger Identification

机译:挖掘博主识别的在线日记

获取原文

摘要

In this paper, we present an investigation of authorship identification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance. Many studies manipulated the problem of authorship identification in manually collected corpora, but only few utilized real data from existing blogs. The complexity of the language model in personal blogs is motivating to identify the correspondent author. The main contribution of this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have been developed with Psychology background, for the first time for authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author. Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs. The results and evaluation show that the utilized features are compact while their performance is highly comparable with other larger feature sets. The analysis also confirmed the most effective parameters, their ranges in the data corpus, and the usefulness of the common users classifier in improving the performance, for the author identification task.
机译:在本文中,我们展示了对个人博客或日记的作者身份证明的调查,这与其他类型的文本不同,如散文,电子邮件或文本的文本。调查利用耦合的直观特征集,并研究影响识别性能的各种参数。许多研究操纵了手动收集了Corpora中的作者身份证明问题,但只有很少利用来自现有博客的真实数据。个人博客中语言模型的复杂性是识别记者作者的激励。这项工作的主要贡献至少是三倍。首先,我们利用LIWC和MRC功能集合在一起,该特征首次与心理背景开发,首次开发了个人博客上的作者身份证明。其次,我们在识别性能上分析各种参数和特征集的效果。这包括数据语料库中的作者数量,帖子大小或单词数,以及每个作者的帖子数。最后,我们研究了对具有常见人格属性的有限用户的作者身份识别。这种分析是由于这些任务的文献中缺乏标准或持续建议,特别是在个人博客领域。结果和评估表明,利用特征是紧凑的,而它们的性能与其他更大的特征集高相当。分析还确认了最有效的参数,数据语料库中的范围,以及普通用户分类器的有用性在提高性能方面,为作者识别任务。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号