首页> 外文会议>2011 Eighth Web Information Systems and Applications Conference >A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration
【24h】

A Bottom-up Approach of Web Data Extraction based on Entity Recognition and Integration

机译:基于实体识别和集成的自下而上的Web数据提取方法

获取原文
获取原文并翻译 | 示例

摘要

Nowadays, most popular methods for web data extraction (WDE) are top-down ones depending on structure. However, these techniques are not scalable enough when coming to complex pages. Consequently, we put forward a bottom-up approach for WDE based on entity recognition and integration to avoid over dependency to structure of web pages. The approach proposed focuses on primary text sequences labeling first and also gives consideration to repetitive patterns of them as well. We propose a Two-Level extraction model for entity recognition and repetitive pattern extraction algorithm for entity integration. Our approach can effectively reduce the attribute labeling mistakes. Also, we demonstrate our approach by scientifically experimental results. The conclusion is that our approach perform better than the traditional extraction techniques, especially on complex Web pages.
机译:如今,最流行的Web数据提取(WDE)方法是自上而下的方法,具体取决于结构。但是,这些技术在进入复杂页面时不够可伸缩。因此,我们提出了一种基于实体识别和集成的自下而上的WDE方法,以避免对Web页面结构的过度依赖。提出的方法首先关注主要文本序列的标签,并且还考虑了它们的重复模式。我们提出了用于实体识别的两级提取模型和用于实体集成的重复模式提取算法。我们的方法可以有效减少属性标记错误。此外,我们通过科学的实验结果证明了我们的方法。结论是,我们的方法比传统的提取技术性能更好,尤其是在复杂的网页上。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号