首页> 外文会议>Proceedings of 2010 4th International Universal Communication Symposium >InForCE: Forum data crawling with information extraction
【24h】

InForCE: Forum data crawling with information extraction

机译:InForCE:论坛数据爬网和信息提取

获取原文

摘要

Forum data acquisition is the prerequisite of forum data analysis, such as opinion analysis, on-line advertisement, and so on. Since the structure of forum data usually has casual relationships with the page structure, effective forum data acquisition requires the integration of Web pages crawling and information extraction. In this paper, we propose a system InForCE for this purpose. The system includes two parts. First, we download Web pages from different forums and generate HTML documents. Second, structured data are extracted from HTML documents in the light of user requiremnts. During the extraction process, a novel algorithm has been proposed to transform user requirement into XSLT automatically. Our experimental results show that structured data extraction is feasible and efficient.
机译:论坛数据获取是论坛数据分析(如意见分析,在线广告等)的前提。由于论坛数据的结构通常与页面结构有偶然的关系,因此有效的论坛数据获取需要集成Web爬网和信息提取。在本文中,我们为此目的提出了一个系统InForCE。该系统包括两个部分。首先,我们从不同的论坛下载网页并生成HTML文档。其次,根据用户要求从HTML文档中提取结构化数据。在提取过程中,提出了一种新颖的算法将用户需求自动转换为XSLT。我们的实验结果表明,结构化数据提取是可行和高效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号