首页> 外文会议>International Conference on Advances in ICT for Emerging Regions >A framework for automated corpus compilation for KeyXtract: Twitter model
【24h】

A framework for automated corpus compilation for KeyXtract: Twitter model

机译:KeyXtract的自动语料库编译框架:Twitter模型

获取原文

摘要

The corpus is a limiting factor for a keyword extraction process with a word matching stage. This paper proposes a framework to automate the corpus generation stage required for the Twitter Model of KeyXtract, an algorithm used for essential keyword extraction from tweets. The initial algorithm was designed with two manually compiled corpora that limited the adaptability of the system. The automated framework proposed in the present research is an extension to the keyword extraction process of KeyXtract and would address this limitation of the system. The design was carried out using open-class words of the source text and by matching them against the bag of words compiled by analyzing the tweets. The automated corpus had a total of 138 words, out of which 74 words were also found in the handpicked corpus (which had a total of 206 words). However, when the corpus was used with the keyword extraction system, the average F1 scores of the system showed a decrease of 0.07, proving that the automated corpus cannot perform parallel to the human-made corpus in complexity. This was because the human-made corpus was compiled using syntactic, semantic and pragmatic features while the automated framework focused only on the syntactic features. However, there were individual tweets in which the F1 score showed an increase. Thus, this was a promising first step in the corpus automation process. The automatic corpus generation framework could be made more accurate by including the semantic analysis of the lexical items. Thus, the present framework is able to substantially address the limitation of the corpus compilation which was present in the Twitter Model of KeyXtract.
机译:语料库是具有单词匹配阶段的关键字提取过程的限制因素。本文提出了一个框架,用于自动化KeyXtract的Twitter模型所需的语料库生成阶段,该模型用于从推文中提取必要的关键字。最初的算法是用两个手动编译的语料库设计的,这些语料库限制了系统的适应性。本研究中提出的自动化框架是对KeyXtract关键字提取过程的扩展,可以解决系统的这一局限性。设计是使用源文本的开放类单词并将其与通过分析推文编译的单词包进行匹配来进行的。自动语料库共有138个单词,在精选语料库中也有74个单词(总共206个单词)。但是,将语料库与关键字提取系统一起使用时,系统的平均F1分数降低了0.07,证明了自动语料库在复杂性上无法与人造语料库并行执行。这是因为人造语料库是使用句法,语义和语用功能编译的,而自动化框架仅关注句法功能。但是,有些推文中F1分数有所提高。因此,这是语料库自动化过程中充满希望的第一步。通过包括词汇项的语义分析,可以使自动语料库生成框架更加准确。因此,本框架能够基本上解决在KeyXtract的Twitter模型中存在的语料库编译的局限性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号