首页> 外文学位 >Quality evaluation of topics identification algorithms.
【24h】

Quality evaluation of topics identification algorithms.

机译:主题识别算法的质量评估。

获取原文
获取原文并翻译 | 示例

摘要

The need for effective text retrieval tools, such as search engines, is omnipresent in the corporate marketplace and defence industry alike. The task of indexing large quantities of text from various sources, such as news and social media is too enormous to be accomplished by humans alone. Automatically identifying keywords, or topics, from unstructured text is an important challenge. Extensive computational experiments were conducted using topic identification methods: the Retrieval Activation and Decay (ReAD) algorithm, the Priming Activation Indexing (PAI) algorithm and the Term Frequency- Inverse Document Frequency (TFIDF) method. These experiments were conducted with a subset of the well known Reuters financial dataset. The computational experiments were conducted to identify the parameters that would return higher quality topics using several well known topics quality evaluation methods: the Fl, the precision, the recall and the Normalized Mutual Information (NMI) measures. Two novel evaluation measures were also proposed: Simple Match Five (SM5) and Expanded Match Five (EM5). The results were generated using the parameters that would return high quality topics according to different computational measures. An online survey with volunteer evaluators was conducted in order to validate these results. The parameters that yielded higher topic qualities were inconsistent from one type of measurement to the next. For the chosen parameters, it was found that TFIDF produced higher quality topics than PAI, and PAI produced higher quality topics than ReAD when submitted to human evaluations. It was found that neither the proposed measures nor the established Fl measure were adequate indicators of topic quality.;Keywords: Topics Identification, Topics Evaluation, Topics Quality.
机译:对于有效的文本检索工具(例如搜索引擎)的需求在公司市场和国防工业中无处不在。索引来自各种来源(例如新闻和社交媒体)的大量文本的任务非常艰巨,无法仅靠人类来完成。从非结构化文本中自动识别关键字或主题是一项重要的挑战。使用主题识别方法进行了广泛的计算实验:检索激活和衰减(ReAD)算法,启动激活索引(PAI)算法和词频-反文档频率(TFIDF)方法。这些实验是使用著名的路透社金融数据集的子集进行的。进行了计算实验,以使用几种众所周知的主题质量评估方法来确定将返回更高质量主题的参数:F1,精度,召回率和归一化互信息(NMI)度量。还提出了两种新颖的评估措施:简单匹配五(SM5)和扩展匹配五(EM5)。结果是使用参数生成的,这些参数将根据不同的计算手段返回高质量的主题。为了验证这些结果,与志愿者评估者进行了在线调查。从一种类型的测量到另一种类型的测量,产生更高主题质量的参数是不一致的。对于选定的参数,发现当提交给人工评估时,TFIDF产生的质量主题高于PAI,PAI产生的质量主题高于ReAD。结果发现,提议的措施和既定的Fl措施都不是主题质量的充分指标。关键词:主题识别,主题评估,主题质量。

著录项

  • 作者单位

    Royal Military College of Canada (Canada).;

  • 授予单位 Royal Military College of Canada (Canada).;
  • 学科 Computer science.;Information science.
  • 学位 M.Sc.
  • 年度 2013
  • 页码 150 p.
  • 总页数 150
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号