...
首页> 外文期刊>International Journal of Intelligent Systems and Applications >Web Data Extraction from Scientific Publishers’ Website Using Heuristic Algorithm
【24h】

Web Data Extraction from Scientific Publishers’ Website Using Heuristic Algorithm

机译:使用启发式算法从科学出版商的网站中提取Web数据

获取原文

摘要

WWW is a huge repository of information and the amount of information available on the web is growing day by day in an exponential manner. End users make use of search engines like Google, Yahoo, and Bingo etc. for retrieving information. Search engines use web crawlers or spiders which crawl through a sequence of web pages in order to locate the relevant pages and provide a set of links ordered by relevancy. Those indexed web pages are part of surface web. Getting data from deep web requires form submission and is not performed by search engines. Data analytics and data mining applications depend on data from deep web pages and automatic extraction of data from deep web is cumbersome due to diverse structure of web pages. In the proposed work, a heuristic algorithm for automatic navigation and information extraction from journal’s home page has been devised. The algorithm is applied to many publishers website such as Nature, Elsevier, BMJ, Wiley etc. and the experimental results show that the heuristic technique provides promising results with respect to precision and recall values.
机译:WWW是一个巨大的信息资源库,Web上可用的信息量正以指数方式增长。最终用户利用Google,Yahoo和Bingo等搜索引擎来检索信息。搜索引擎使用网络爬虫或蜘蛛来搜寻一系列网页,以便找到相关页面并提供按相关性排序的一组链接。这些索引网页是表面网页的一部分。从深层网络获取数据需要提交表单,而不是由搜索引擎执行。数据分析和数据挖掘应用程序依赖于深层网页中的数据,由于网页结构的多样性,从深层网页中自动提取数据非常麻烦。在拟议的工作中,已经设计了一种启发式算法,用于从期刊主页自动导航和提取信息。该算法已应用于许多出版商的网站,如Nature,Elsevier,BMJ,Wiley等,实验结果表明,启发式技术在准确性和查全率方面提供了有希望的结果。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号