首页> 外文期刊>Journal of integrated design & process science >DATA EXTRACTION FROM REPOSITORIES ON THE WEB: A SEMI-AUTOMATIC APPROACH
【24h】

DATA EXTRACTION FROM REPOSITORIES ON THE WEB: A SEMI-AUTOMATIC APPROACH

机译:从网络存储库中提取数据:一种半自动方法

获取原文
获取原文并翻译 | 示例
           

摘要

The World Wide Web (WWW) is becoming the most important source of information for business intelligence and information dissemination. Past information gathering techniques like surfing and sifting are proving insufficient in processing the vast volumes of data readily available from the Web. In addition, companies are being forced to integrate this vast data repository within specific cost, time, and reliability spectrums. This paper presents the fundamentals of a system called "Browser Harness" (B2H) that extracts the requested data from Web sites in a supervised fashion. The algorithmic background of this system is based on the tag structure of web pages, as HTML is the predominate choice for rendering web page content on the WWW. B2H is an interactive tool for harnessing data from semi-structured and structured web pages by analyzing the tag structure of the input page and locating the data in the HTML code. The extracted data is then exported to XML, delimited text, or database tables.
机译:万维网(WWW)正在成为用于商业智能和信息分发的最重要的信息源。事实证明,过去的信息收集技术(例如冲浪和筛选)不足以处理大量易于从Web获得的数据。此外,公司被迫在特定的成本,时间和可靠性范围内集成这个庞大的数据存储库。本文介绍了称为“浏览器安全带”(B2H)的系统的基础,该系统以监督方式从网站中提取请求的数据。该系统的算法背景基于网页的标签结构,因为HTML是在WWW上呈现网页内容的主要选择。 B2H是一种交互式工具,可通过分析输入页面的标签结构并在HTML代码中定位数据来利用来自半结构化和结构化网页的数据。然后将提取的数据导出到XML,定界文本或数据库表。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号