首页> 外文会议>International world wide web conference;WWW 09 >Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums
【24h】

Incorporating Site-Level Knowledge to Extract Structured Data from Web Forums

机译:整合站点级知识以从Web论坛中提取结构化数据

获取原文

摘要

Web forums have become an important data resource for many web applications, but extracting structured data from unstructured web forum pages is still a challenging task due to both complex page layout designs and unrestricted user created posts. In this paper, we study the problem of structured data extraction from various web forum sites. Our target is to find a solution as general as possible to extract structured data, such as post title, post author, post time, and post content from any forum site. In contrast to most existing information extraction methods, which only leverage the knowledge inside an individual page, we incorporate both page-level and site-level knowledge and employ Markov logic networks (MLNs) to effectively integrate all useful evidence by learning their importance automatically. Site-level knowledge includes (1) the linkages among different object pages, such as list pages and post pages, and (2) the interrelationships of pages belonging to the same object. The experimental results on 20 forums show a very encouraging information extraction performance, and demonstrate the ability of the proposed approach on various forums. We also show that the performance is limited if only page-level knowledge is used, while when incorporating the site-level knowledge both precision and recall can be significantly improved.
机译:Web论坛已成为许多Web应用程序的重要数据资源,但是由于复杂的页面布局设计和不受限制的用户创建帖子,从非结构化Web论坛页面提取结构化数据仍然是一项艰巨的任务。在本文中,我们研究了从各种Web论坛站点提取结构化数据的问题。我们的目标是找到一个尽可能通用的解决方案,以从任何论坛站点中提取结构化数据,例如帖子标题,帖子作者,帖子时间和帖子内容。与大多数仅利用单个页面内的知识的现有信息提取方法相反,我们结合了页面级和站点级知识,并采用马尔可夫逻辑网络(MLN)通过自动学习其重要性来有效地整合所有有用的证据。站点级知识包括(1)不同对象页面(例如列表页面和帖子页面)之间的链接,以及(2)属于同一对象的页面之间的相互关系。在20个论坛上的实验结果显示出非常令人鼓舞的信息提取性能,并在各种论坛上展示了该方法的能力。我们还表明,如果仅使用页面级知识,则性能会受到限制,而在合并站点级知识时,可以显着提高准确性和查全率。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号