首页> 外文会议>International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology >A Corpus-based Approach for Keyword Identification using Supervised Learning Techniques
【24h】

A Corpus-based Approach for Keyword Identification using Supervised Learning Techniques

机译:基于语料库的关键字识别方法使用受监督学习技术

获取原文

摘要

This paper presents a corpus-based approach for extracting keywords from a text written in a language that has no word boundary. Based on the concept of Thai character cluster, a Thai running text is preliminarily segmented into a sequence of inseparable units, called TCCs. To enable the handling of a large-scaled text, a sorted sistring (or suffix array) is applied to calculate a number of statistics of each TCC. Using these statistics, we applied three alternative supervised machine learning techniques, naive Bayes, centroid-based and k-NN, to learn classifiers for keyword identification. Our method is evaluated using a medical text extracted from WWW. The result showed that k-NN achieves the highest performance of 79.5% accuracy.
机译:本文介绍了一种基于语料库的方法,用于从以没有单词边界写入的语言写入的文本中提取关键字。基于泰国字符集群的概念,泰国运行文本被预先分割成一系列不可分割的单位,称为TCC。要启用大规模文本的处理,应用了排序的频带(或后缀数组)来计算每个TCC的许多统计信息。使用这些统计数据,我们应用了三种替代监督机器学习技术,天真贝叶斯,基于质心和K-NN,学习用于关键字识别的分类器。我们的方法使用从WWW中提取的医疗文本进行评估。结果表明,K-NN精度的最高性能为79.5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号