【24h】

KPSpotter: A Flexible Information Gain-based Keyphrase Extraction System

机译:KPSpotter:灵活的基于信息增益的关键词提取系统

获取原文
获取原文并翻译 | 示例

摘要

To tackle the issue of information overload, we present an Information Gain-based KeyPhrase Extraction System, called KPSpotter. KPSpotter is a flexible web-enabled keyphrase extraction system, capable of processing various formats of input data, including web data, and generating the extraction model as well as the list of keyphrases in XML. In KPSpotter, the following two features were selected for training and extracting keyphrases: 1) TF*IDF and 2) Distance from First Occurrence. Input training and testing collections were processed in three stages: 1) Data Cleaning, 2) Data Tokenizing, and 3) Data Discretizing. To measure the system performance, the keyphrases extracted by KPSpotter are compared with the ones that the authors assigned. Our experiments show that the performance of KPSpotter was evaluated to be equivalent to KEA, a well-known keyphrase extraction system. KPSpotter, however, is differentiated from other extraction systems in the followings: First, KPSpotter employs a new keyphrase extraction technique that combines the Information Gain data mining measure and several Natural Language Processing techniques such as stemming and case-folding. Second, KPSpotter is able to process various types of input data such as XML, HTML, and unstructured text data and generate XML output. Third, the user can provide input data and execute KPSpotter through the Internet. Fourth, for efficiency and performance reason, KPSpotter stores candidate keyphrases and its related information such as frequency and stemmed form into an embedded database management system.
机译:为了解决信息过载的问题,我们提出了一种基于信息获取的密钥短语提取系统,称为KPSpotter。 KPSpotter是一个灵活的,支持Web的关键字短语提取系统,能够处理各种格式的输入数据(包括Web数据),并生成提取模型以及XML中的关键字短语列表。在KPSpotter中,选择了以下两个特征来训练和提取关键短语:1)TF * IDF和2)距首次出现的距离。输入培训和测试收集分为三个阶段:1)数据清理,2)数据标记化和3)数据离散化。为了衡量系统性能,将KPSpotter提取的关键字与作者分配的关键字进行了比较。我们的实验表明,KPSpotter的性能被评估为等同于KEA(一种著名的关键词提取系统)。但是,KPSpotter在以下方面与其他提取系统有所不同:首先,KPSpotter采用了一种新的关键词提取技术,该技术结合了信息增益数据挖掘措施和几种自然语言处理技术,例如词干和案例折叠。其次,KPSpotter能够处理各种类型的输入数据,例如XML,HTML和非结构化文本数据,并生成XML输出。第三,用户可以通过互联网提供输入数据并执行KPSpotter。第四,出于效率和性能的考虑,KPSpotter将候选关键字短语及其相关信息(例如频率和词干形式)存储到嵌入式数据库管理系统中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号