首页> 外文期刊>ACM transactions on Asian language information processing >Arabic Authorship Attribution: An Extensive Study on Twitter Posts
【24h】

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

机译:阿拉伯语作者身份归属:对Twitter帖子的广泛研究

获取原文
获取原文并翻译 | 示例
       

摘要

Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection and traceability. To address the problem of anonymity, authorship analysis is used to identify individuals by their writing styles without knowing their actual identities. Most authorship studies are dedicated to English due to its widespread use over the Internet, but recent cyber-attacks such as the distribution of Stuxnet indicate that Internet crimes are not limited to a certain community, language, culture, ideology, or ethnicity. To effectively investigate cybercrime and to address the problem of anonymity in online communication, there is a pressing need to study authorship analysis of languages such as Arabic, Chinese, Turkish, and so on. Arabic, the focus of this study, is the fourth most widely used language on the Internet. This study investigates authorship of Arabic discourse/text, especially tiny text, Twitter posts. We benchmark the performance of a profile-based approach that uses n-grams as features and compare it with state-of-the-art instance-based classification techniques. Then we adapt an event-visualization tool that is developed for English to accommodate both Arabic and English languages and visualize the result of the attribution evidence. In addition, we investigate the relative effect of the training set, the length of tweets, and the number of authors on authorship classification accuracy. Finally, we show that diacritics have an insignificant effect on the attribution process and part-of-speech tags are less effective than character-level and word-level n-grams.
机译:执法部门在网络犯罪调查中追查罪犯的真实身份时面临问题。大多数罪犯会掩盖其真实身份,冒充上级权威,或使用身份欺骗和混淆策略来避免侦查和追溯。为了解决匿名问题,作者分析用于根据个人的写作风格来识别他们,而不知道他们的真实身份。由于英语在互联网上的广泛使用,大多数作者研究都致力于英语,但是最近的网络攻击(如Stuxnet的分发)表明,互联网犯罪不仅限于特定的社区,语言,文化,意识形态或种族。为了有效地调查网络犯罪并解决在线交流中的匿名性问题,迫切需要研究诸如阿拉伯文,中文,土耳其文等语言的作者身份分析。这项研究的重点是阿拉伯语,它是Internet上第四广泛使用的语言。这项研究调查了阿拉伯语篇/文本,尤其是Twitter帖子中的小文本的作者身份。我们对以n-grams为特征的基于配置文件的方法的性能进行基准测试,并将其与最新的基于实例的分类技术进行比较。然后,我们采用为英语开发的事件可视化工具,以适应阿拉伯语和英语,并可视化归因证据的结果。此外,我们调查了训练集,推文的长度和作者数量对作者分类准确性的相对影响。最后,我们证明了变音符号对归因过程的影响微不足道,并且词性标签的效果不如字符级和词级n-gram更好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号