Mining Online Diaries for Blogger Identification

机译：挖掘博主识别的在线日记

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we present an investigation of authorship identification on personal blogs or diaries, which are different from other types of text such as essays, emails, or articles based on the text properties. The investigation utilizes couple of intuitive feature sets and studies various parameters that affect the identification performance. Many studies manipulated the problem of authorship identification in manually collected corpora, but only few utilized real data from existing blogs. The complexity of the language model in personal blogs is motivating to identify the correspondent author. The main contribution of this work is at least three folds. Firstly, we utilize the LIWC and MRC feature sets together, which have been developed with Psychology background, for the first time for authorship identification on personal blogs. Secondly, we analyze the effect of various parameters, and feature sets, on the identification performance. This includes the number of authors in the data corpus, the post size or the word count, and the number of posts for each author. Finally, we study applying authorship identification over a limited set of users that have a common personality attributes. This analysis is motivated by the lack of standard or solid recommendations in literature for such task, especially in the domain of personal blogs. The results and evaluation show that the utilized features are compact while their performance is highly comparable with other larger feature sets. The analysis also confirmed the most effective parameters, their ranges in the data corpus, and the usefulness of the common users classifier in improving the performance, for the author identification task.

机译：在本文中，我们展示了对个人博客或日记的作者身份证明的调查，这与其他类型的文本不同，如散文，电子邮件或文本的文本。调查利用耦合的直观特征集，并研究影响识别性能的各种参数。许多研究操纵了手动收集了Corpora中的作者身份证明问题，但只有很少利用来自现有博客的真实数据。个人博客中语言模型的复杂性是识别记者作者的激励。这项工作的主要贡献至少是三倍。首先，我们利用LIWC和MRC功能集合在一起，该特征首次与心理背景开发，首次开发了个人博客上的作者身份证明。其次，我们在识别性能上分析各种参数和特征集的效果。这包括数据语料库中的作者数量，帖子大小或单词数，以及每个作者的帖子数。最后，我们研究了对具有常见人格属性的有限用户的作者身份识别。这种分析是由于这些任务的文献中缺乏标准或持续建议，特别是在个人博客领域。结果和评估表明，利用特征是紧凑的，而它们的性能与其他更大的特征集高相当。分析还确认了最有效的参数，数据语料库中的范围，以及普通用户分类器的有用性在提高性能方面，为作者识别任务。

著录项

来源
《World Congress on Engineering》|2009年||共8页
会议地点
作者
Haytham Mohtasseb; Amr Ahmed;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 T-53;
关键词
Web Mining; Information Extraction; Psycholinguistic; Machine Learning; Authorship Identification;

机译：网站挖掘;信息提取;心理语言;机器学习;作者身份证明;

相似文献

外文文献
中文文献
专利

1. PIB: Profiling Influential Blogger in Online Social Networks, A Knowledge Driven Data Mining Approach [J] . G.U. Vasanthakumar, Bagul Prajakta, P. Deepa Shenoy, Procedia Computer Science . 2015,第1期

机译：PIB：在线社交网络中有影响力的Blogger分析，这是一种知识驱动的数据挖掘方法
2. Understanding the Different Types of Social Support Offered by Audience to A-List Diary-Like and Informative Bloggers [J] . Hsiu-Chia Ko, Li-Ling Wang, Yi-Ting Xu Cyberpsychology, behavior and social networking . 2013,第3期

机译：了解受众向A-List日记和内容丰富的Blogger提供的不同类型的社会支持
3. For more events and to book online, please visit www.rsm.ac.uk/diary [J] . Journal of the Royal Society of Medicine . 2020,第3期

机译：有关更多事件并在线预订，请访问 www.rsm.ac.uk/diary
4. Mining Online Diaries for Blogger Identification [C] . Haytham Mohtasseb, Amr Ahmed World Congress on Engineering . 2009

机译：挖掘博主识别的在线日记
5. The secret world of women bloggers: A feminist exploration of the Internet diary writing practices of Canadian women. [D] . Prior, Elvira M. 2005

机译：女博客的秘密世界：女权主义探索加拿大女性的互联网日记写作手法。
6. Understanding the Different Types of Social Support Offered by Audience to A-List Diary-Like and Informative Bloggers [O] . Hsiu-Chia Ko, Li-Ling Wang, Yi-Ting Xu -1

机译：了解受众向A-List日记和内容丰富的Blogger提供的不同类型的社会支持
7. Mining online diaries for blogger identification [O] . Mohtasseb Haytham, Ahmed Amr 2009

机译：挖掘在线日记以识别博客

Mining Online Diaries for Blogger Identification

摘要

著录项

相似文献

相关主题

期刊订阅