首页> 外文会议>International Conference on Web Research >Pattern matching for extraction of core contents from news web pages
【24h】

Pattern matching for extraction of core contents from news web pages

机译:从新闻网页提取核心内容的模式匹配

获取原文

摘要

Web pages, besides core contents, consist of other elements, such as banners, navigational elements, copyright information, external links, etc. This noisy content covers more area of web pages and is typically not related to the main subjects of the web pages. Most of the information available on web pages is either represented in XML, or HTML, or XHTML format that mostly contains semi-structured text documents, which lacks formatted document structure. This document does not discriminate between the text and the schema, and the amount of structure used to represent the text depends on the purpose. No semantic is applied to semi-structured documents. This requires extracting core contents of text document to analyse words or sentences for retrieving relevant information. Although there are many existing methods that formulate the actual content identification problem as a DOM tree node selection problem, each one has some sort of lacunae. Here we proposed an approach based on pattern matching technique. This technique uses simple heuristic for extraction of core contents from web pages which are mostly semi-structured in nature. It requires visiting the appropriate news web site using their URL, accessing the links related to each news page of specified category, extracting the data including metadata from each of these news web pages. The approach uses devised algorithm that applies regular expressions (regexes) to identify the correct pattern for extracting the actual text contents from these news documents. Proposed approach deals with news web pages of any size and extracts core contents with efficiency and high accuracy.
机译:网页,除了核心的内容,包含其他元素,如横幅,导航元素,版权信息,外部链接等网页这嘈杂的内容涵盖多个领域,并且通常是不相关的网页的主要议题。 Web页面上可用的大多数信息由XML或HTML或XHTML格式表示,主要包含半结构化文本文档,这些文本文档缺少格式化的文档结构。本文档不区分文本和模式,并且用于表示文本的结构量取决于目的。没有语义应用于半结构化文件。这需要提取文本文档的核心内容,以分析检索相关信息的单词或句子。虽然存在许多现有方法,其将实际内容识别问题作为DOM树节点选择问题,但每个都有一些Lacunae。在这里,我们提出了一种基于模式匹配技术的方法。该技术使用简单的启发式从网页提取核心内容,这些网页主要是半结构化的。它需要使用其URL访问相应的新闻网站,访问与指定类别的每个新闻页面相关的链接,从这些新闻网页中提取包括元数据的数据。该方法使用设计了应用正则表达式(正则表达式)来识别从这些新闻文档中提取实际文本内容的正确模式。建议的方法处理任何大小的新闻网页,并以效率和高精度提取核心内容。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号