首页> 外文OA文献 >Ranking XPaths for extracting search result records
【2h】

Ranking XPaths for extracting search result records

机译:对Xpath进行排名以提取搜索结果记录

摘要

Extracting search result records (SRRs) from webpages is useful for building an aggregated search engine which combines search results from a variety of search engines. Most automatic approaches to search result extraction are not portable: the complete process has to be rerun on a new search result page. In this paper we describe an algorithm to automatically determine XPath expressions to extract SRRs from webpages. Based on a single search result page, an XPath expression is determined which can be reused to extract SRRs from pages based on the same template. The algorithm is evaluated on a six datasets, including two new datasets containing a variety of web, image, video, shopping and news search results. The evaluation shows that for 85% of the tested search result pages, a useful XPath is determined. The algorithm is implemented as a browser plugin and as a standalone application which are available as open source software.
机译:从网页提取搜索结果记录(SRR)对于构建将来自各种搜索引擎的搜索结果进行组合的聚合搜索引擎很有用。大多数自动提取搜索结果的方法都不是可移植的:整个过程必须在新的搜索结果页面上重新运行。在本文中,我们描述了一种自动确定XPath表达式以从网页提取SRR的算法。基于单个搜索结果页面,可以确定一个XPath表达式,该表达式可以重复使用以基于同一模板从页面中提取SRR。该算法在六个数据集上进行了评估,其中包括两个新的数据集,其中包含各种Web,图像,视频,购物和新闻搜索结果。评估显示,对于85%的测试搜索结果页,确定了有用的XPath。该算法以浏览器插件和独立应用程序的形式实现,可以作为开源软件使用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号