首页> 外文期刊>Natural language engineering >A tree-based learning approach for document structure analysis and its application to web search
【24h】

A tree-based learning approach for document structure analysis and its application to web search

机译:基于树的文档结构分析学习方法及其在网络搜索中的应用

获取原文
获取原文并翻译 | 示例
           

摘要

In this paper, we study the problem of structural analysis of Web documents aiming at extracting the sectional hierarchy of a document. In general, a document can be represented as a hierarchy of sections and subsections with corresponding headings and subheadings. We developed two machine learning models: heading extraction model and hierarchy extraction model. Heading extraction was formulated as a classification problem whereas a tree-based learning approach was employed in hierarchy extraction. For this purpose, we developed an incremental learning algorithm based on support vector machines and perceptrons. The models were evaluated in detail with respect to the performance of the heading and hierarchy extraction tasks. For comparison, a baseline rule-based approach was used that relies on heuristics and HTML document object model tree processing. The machine learning approach, which is a fully automatic approach, outperformed the rule-based approach. We also analyzed the effect of document structuring on automatic summarization in the context of Web search. The results of the task-based evaluation on TREC queries showed that structured summaries are superior to unstructured summaries both in terms of accuracy and user ratings, and enable the users to determine the relevancy of search results more accurately than search engine snippets.
机译:在本文中,我们研究旨在提取文档的部分层次结构的Web文档的结构分析问题。通常,文档可以表示为具有相应标题和子标题的部分和子部分的层次结构。我们开发了两种机器学习模型:标题提取模型和层次提取模型。标题提取被公式化为一个分类问题,而层次结构提取中采用了基于树的学习方法。为此,我们开发了基于支持向量机和感知器的增量学习算法。针对标题和层次提取任务的性能,对模型进行了详细评估。为了进行比较,使用了基于基线规则的方法,该方法依赖于试探法和HTML文档对象模型树处理。机器学习方法是一种全自动方法,其性能优于基于规则的方法。我们还分析了在Web搜索环境中文档结构对自动摘要的影响。基于任务的TREC查询评估结果表明,结构化摘要在准确性和用户评分方面均优于非结构化摘要,并且使用户能够比搜索引擎摘要更准确地确定搜索结果的相关性。

著录项

  • 来源
    《Natural language engineering》 |2015年第4期|569-605|共37页
  • 作者

    F. CANAN PEMBE; TUNGA GUENGOER;

  • 作者单位

    TUEBITAK BILGEM, 41470, Gebze, Kocaeli, Turkey;

    Department of Computer Engineering, Bogazici University, Istanbul, 34342, Turkey;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号