首页> 外文会议>International Conference on Research and Innovation in Information Systems >Improving multi-term topics focused crawling by introducing term Frequency-Information Content (TF-IC) measure
【24h】

Improving multi-term topics focused crawling by introducing term Frequency-Information Content (TF-IC) measure

机译:通过引入术语频率信息内容(TF-IC)措施来改进针对爬虫的多个术语主题

获取原文

摘要

By rapid growth of the Internet, finding desirable information would be a challenging and time consuming task. In order to tackle this issue, focused crawlers, as the ideal solution, through mining of the Web, help us to find web pages closely relevant to the desired information. For this purpose, a variety of methods are devised and implemented. Nonetheless, the majority of these methods do not favor more informative terms in a given multi-term topic. In this paper, we propose a new measure called Term Frequency-Information Content (TF-IC) to prioritize terms in a multi-term topic accordingly. Through conducted experiments, we compare our measure against both Term Frequency-Inverse Document Frequency (TF-IDF) and Latent Semantic Indexing (LSI) measures applied in focused crawlers. Experimental results indicate superiority of our measure over TF-IDF and LSI for collecting more relevant web pages of both general and specialized multi-term topics.
机译:随着Internet的快速发展,找到理想的信息将是一项艰巨而耗时的任务。为了解决这个问题,专注的爬虫作为理想的解决方案,通过挖掘Web可以帮助我们找到与所需信息紧密相关的网页。为此目的,设计和实现了多种方法。但是,这些方法中的大多数并不支持给定的长期主题中提供更多信息的术语。在本文中,我们提出了一种称为术语频率信息内容(TF-IC)的新措施,以相应地对多术语主题中的术语进行优先级排序。通过进行的实验,我们将我们的测度与集中抓取工具中使用的术语频率反文档频率(TF-IDF)和潜在语义索引(LSI)测度进行了比较。实验结果表明,我们的方法优于TF-IDF和LSI,可以收集更多有关常规和专门的长期主题的网页。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号