首页> 外文会议>Research Challenges in Information Science, 2009. RCIS 2009 >AgentMat: Framework for data scraping and semantization
【24h】

AgentMat: Framework for data scraping and semantization

机译:AgentMat:数据抓取和语义化框架

获取原文

摘要

Most of the enormous amount of information from the internet is available just like Web pages made for a human reader. They don't have any common interface for accessing, searching or browsing the data. Hence, it's hard to extract the semantic data from the Web, categorize them and keep them updated. For this purpose we have designed and implemented a system called AgentMat. This system is designed for efficient extraction of large amount of data from the Web pages. AgentMat processing is based on an XML-based language describing the given extraction task in a declarative way. The task description consists of system components, which connected together are able to perform the desired functionality on a general Web page. Thanks to this scraping system the raw contents from the irregularly updated and unstructured Web pages can be kept categorized and accessed together with the semantic metadata. In our pilot implementation we have built the MediaPub system, which extracts the information from various Web pages, does automatic categorizing and checks for duplicities.
机译:互联网上的大量信息中的大多数都可以像为人类读者制作的网页一样获得。它们没有用于访问,搜索或浏览数据的通用接口。因此,很难从Web提取语义数据,对其进行分类并保持更新。为此,我们设计并实现了一个名为AgentMat的系统。该系统设计用于从网页中高效提取大量数据。 AgentMat处理基于基于XML的语言,该语言以声明的方式描述给定的提取任务。任务说明由系统组件组成,这些组件连接在一起,能够在常规Web页面上执行所需的功能。由于采用了这种抓取系统,可以将来自不定期更新和非结构化网页的原始内容与语义元数据一起进行分类和访问。在我们的试验性实施中,我们构建了MediaPub系统,该系统从各种Web页面中提取信息,进行自动分类并检查重复性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号