...
【24h】

Genetic mining of HTML structures for effective Web-document retrieval

机译:HTML结构的遗传挖掘以有效地检索Web文档

获取原文
获取原文并翻译 | 示例

摘要

Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents for effective Web-document retrieval. A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes. The proposed method has been evaluated on artificial text sets and a large-scale TREC document collection. Experimental evidence supports that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and indicates that the proposed approach significantly improves the performance in retrieval accuracy. In particular, the use of the document-structure mining approach tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments. [References: 44]
机译:Web文档具有许多指示文本结构的标签。用HTML标签标记的文本段具有特定含义,可以用来提高文档检索系统的性能。在本文中,我们提出了一种机器学习方法来挖掘HTML文档的结构,以进行有效的Web文档检索。描述了一种遗传算法,该算法学习HTML标记的重要性因素,这些标记用于对通过标准加权方案检索的文档进行重新排名。该方法已经在人工文本集和大规模的TREC文档集中进行了评估。实验证据表明,根据检索的重要因素,所提出的算法可以很好地训练标签的权重,并表明所提出的方法显着提高了检索准确性。特别是,使用文档结构挖掘方法倾向于将相关文档移到较高级别,这在交互式Web信息检索环境中尤其重要。 [参考:44]

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号