Applying clustering algorithms to determine authorship of chinese twitter messages

机译：应用群集算法确定汉语推特邮件的作者

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Author attribution research of character-based languages such as Chinese is still in its early stages. In this paper, we study the effectiveness of two popular clustering algorithms in determining the authorship of Chinese Twitter messages. We create a data-set of ten authors with 100 tweets each from publicly-available Chinese Twitter profiles. We analyze the data using simple k-means (SKM) and Expectation Maximization (EM), two popular clustering algorithms available in the Waikato Environment for Knowledge Analysis (WEKA). Our feature set includes character n-grams and Chinese function words derived from the literature. We achieve accuracy up to 44:53% for three authors, 29:24% for five authors, and 20:52% for ten authors. For our data-sets and the number of authors we compared, SKM returns better accuracy ratings. Lastly, we determine that function words are valuable features in attributing Chinese Tweets, and identify which of these Chinese function words were of most value.

机译：作者归因研究中文如中文的语言仍处于早期阶段。在本文中，我们研究了两个流行聚类算法在确定中国推特邮件的作者中的有效性。我们创建了10个作者的数据集，每个作者都有100个推文，每个推文都来自公开的中文推特配置文件。我们使用简单的K-means（SKM）和期望最大化（EM），在Waikato环境中分析数据，用于知识分析（Weka）的两个流行聚类算法。我们的功能集包括源自文献的字符n-gram和中国功能词。对于三位作者来说，我们的准确性高达44：53 ％，为十名作者29：24 ％，而十名作者为20：52 ％。对于我们的数据集和我们比较的作者数量，SKM返回更好的准确度评级。最后，我们确定函数单词是归因中国推文的宝贵功能，并确定这些中文功能中的哪一个具有大多数值。

著录项

来源
《IEEE MIT Undergraduate Research Technology Conference》|2016年|118p|共4页
会议地点
作者
Jinny Yan; Suzanne J. Matthews;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词
Twitter; Clustering algorithms; Writing; Government; Libraries; Feature extraction; Data collection;

机译：Twitter的;聚类算法;写作;政府;图书馆;特征提取;数据收集;

相似文献

外文文献
中文文献
专利

1. Applying authorship analysis to extremist-group Web forum messages [J] . Abbasi A., Chen H. IEEE intelligent systems . 2005,第5期

机译：将作者身份分析应用于极端主义团体Web论坛消息
2. Optimization Based Clustering Algorithms for Authorship Analysis of Phishing Emails [J] . Seifollahi Sattar, Bagirov Adil, Layton Robert, Neural processing letters . 2017,第2期

机译：基于优化的网络钓鱼邮件作者分析聚类算法
3. A linguistic approach for determining the topics of Spanish Twitter messages [J] . David Vilares, Miguel A. Alonso, Carlos Gomez-Rodriguez Journal of Information Science . 2015,第2期

机译：确定西班牙Twitter消息主题的语言方法
4. Applying clustering algorithms to determine authorship of chinese twitter messages [C] . Jinny Yan, Suzanne J. Matthews IEEE MIT Undergraduate Research Technology Conference . 2016

机译：应用聚类算法确定中国Twitter消息的作者身份
5. Beyond Affinity Propagation: Message Passing Algorithms for Clustering. [D] . Givoni, Inmar-Ella. 2012

机译：超越亲和力传播：用于群集的消息传递算法。
6. Authorship Weightage Algorithm for Academic Publications: A New Calculation and ACES Webserver for Determining Expertise [O] . Wei-Ling Wu, Owen Tan, Kwok-Fong Chan, 2021

机译：学术出版物的作者重量算法：新计算与ACES网络服务器
7. Authorship Authentication for Twitter Messages Using Support Vector Machine [O] . Nesibe Merve Demir 2016

机译：使用支持向量机的Twitter消息的作者身份验证

Applying clustering algorithms to determine authorship of chinese twitter messages

摘要

著录项

相似文献

相关主题

期刊订阅