首页> 外文会议>2012 IEEE International Conference on Advanced Communication, Control and Computing Technologies. >A novel ensemble vision based deep web data extraction technique for web mining applications
【24h】

A novel ensemble vision based deep web data extraction technique for web mining applications

机译:一种新颖的基于集合视觉的深度Web数据挖掘技术,用于Web挖掘应用

获取原文
获取原文并翻译 | 示例

摘要

Web Content extraction is the task of extracting structured information from unstructured and semi-structured machine-readable documents. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimedia document processing like automatic annotation and content extraction out of images and audio, video could be seen as information extraction. Similarly, information retrieval is the process which is based on user's query. The retrieved information is to be extracted using the web content extraction concept. The Challenges for this type of web page content extraction is increasing now-a-days. In this work, we study the problem of automatically extracting the contents from the web pages. Many more researches have been done to address this problem. The existing approaches have some limitations such as that, it has no sufficient power to deal with the large number of web pages and also that they are web-page-programming- language(HTML) dependent. Our proposed work is to overcome the limitations of the existing system. This work deals with information retrieval process in which the Vision based approach is applied, which helps to extract both images and text from the web pages. In fact most of researches show that when a page is presented to the user, the spatial and visual features play a very important role because they help the user to unconsciously divide the webpage into several semantic parts. Hence, proposed work focus on the primary visual features of a web page. The extraction is carried out on the basis of these features. This approach can gain a better performance when compared with other traditional methods.
机译:Web内容提取是从非结构化和半结构化的机器可读文档中提取结构化信息的任务。在大多数情况下,此活动涉及通过自然语言处理(NLP)处理人类语言文本。多媒体文档处理中的最新活动,例如从图像和音频,视频中自动注释和内容提取,可以看作是信息提取。同样,信息检索是基于用户查询的过程。将使用Web内容提取概念来提取检索到的信息。如今,这种类型的网页内容提取面临的挑战日益增加。在这项工作中,我们研究了自动从网页中提取内容的问题。为了解决这个问题,已经进行了更多的研究。现有方法具有一些局限性,例如,它没有足够的能力来处理大量网页,并且它们依赖于网页编程语言(HTML)。我们提出的工作是要克服现有系统的局限性。这项工作涉及应用基于视觉的方法的信息检索过程,该过程有助于从网页中提取图像和文本。实际上,大多数研究表明,当页面呈现给用户时,空间和视觉功能起着非常重要的作用,因为它们可以帮助用户无意识地将网页分为几个语义部分。因此,建议的工作集中在网页的主要视觉特征上。根据这些特征进行提取。与其他传统方法相比,此方法可以获得更好的性能。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号