首页> 外文会议>IEEE MIT Undergraduate Research Technology Conference >Applying clustering algorithms to determine authorship of chinese twitter messages
【24h】

Applying clustering algorithms to determine authorship of chinese twitter messages

机译:应用群集算法确定汉语推特邮件的作者

获取原文

摘要

Author attribution research of character-based languages such as Chinese is still in its early stages. In this paper, we study the effectiveness of two popular clustering algorithms in determining the authorship of Chinese Twitter messages. We create a data-set of ten authors with 100 tweets each from publicly-available Chinese Twitter profiles. We analyze the data using simple k-means (SKM) and Expectation Maximization (EM), two popular clustering algorithms available in the Waikato Environment for Knowledge Analysis (WEKA). Our feature set includes character n-grams and Chinese function words derived from the literature. We achieve accuracy up to 44:53% for three authors, 29:24% for five authors, and 20:52% for ten authors. For our data-sets and the number of authors we compared, SKM returns better accuracy ratings. Lastly, we determine that function words are valuable features in attributing Chinese Tweets, and identify which of these Chinese function words were of most value.
机译:作者归因研究中文如中文的语言仍处于早期阶段。在本文中,我们研究了两个流行聚类算法在确定中国推特邮件的作者中的有效性。我们创建了10个作者的数据集,每个作者都有100个推文,每个推文都来自公开的中文推特配置文件。我们使用简单的K-means(SKM)和期望最大化(EM),在Waikato环境中分析数据,用于知识分析(Weka)的两个流行聚类算法。我们的功能集包括源自文献的字符n-gram和中国功能词。对于三位作者来说,我们的准确性高达44:53 %,为十名作者29:24 %,而十名作者为20:52 %。对于我们的数据集和我们比较的作者数量,SKM返回更好的准确度评级。最后,我们确定函数单词是归因中国推文的宝贵功能,并确定这些中文功能中的哪一个具有大多数值。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号