首页> 外文会议>International Conference on Web Research >Pattern matching for extraction of core contents from news web pages
【24h】

Pattern matching for extraction of core contents from news web pages

机译:模式匹配,用于从新闻网页中提取核心内容

获取原文

摘要

Web pages, besides core contents, consist of other elements, such as banners, navigational elements, copyright information, external links, etc. This noisy content covers more area of web pages and is typically not related to the main subjects of the web pages. Most of the information available on web pages is either represented in XML, or HTML, or XHTML format that mostly contains semi-structured text documents, which lacks formatted document structure. This document does not discriminate between the text and the schema, and the amount of structure used to represent the text depends on the purpose. No semantic is applied to semi-structured documents. This requires extracting core contents of text document to analyse words or sentences for retrieving relevant information. Although there are many existing methods that formulate the actual content identification problem as a DOM tree node selection problem, each one has some sort of lacunae. Here we proposed an approach based on pattern matching technique. This technique uses simple heuristic for extraction of core contents from web pages which are mostly semi-structured in nature. It requires visiting the appropriate news web site using their URL, accessing the links related to each news page of specified category, extracting the data including metadata from each of these news web pages. The approach uses devised algorithm that applies regular expressions (regexes) to identify the correct pattern for extracting the actual text contents from these news documents. Proposed approach deals with news web pages of any size and extracts core contents with efficiency and high accuracy.
机译:网页除核心内容外,还包含其他元素,例如横幅,导航元素,版权信息,外部链接等。这种嘈杂的内容覆盖了网页的更多区域,通常与网页的主要主题无关。网页上可用的大多数信息都以XML或HTML或XHTML格式表示,其中大多数包含半结构化的文本文档,而缺少格式化的文档结构。本文档没有区分文本和架构,用于表示文本的结构量取决于目的。没有语义应用于半结构化文档。这需要提取文本文档的核心内容来分析单词或句子以检索相关信息。尽管存在许多将实际内容标识问题表达为DOM树节点选择问题的方法,但是每种方法都有某种缺陷。在这里,我们提出了一种基于模式匹配技术的方法。该技术使用简单的启发式方法从本质上大多为半结构化的网页中提取核心内容。它要求使用其URL访问适当的新闻网站,访问与指定类别的每个新闻页面相关的链接,并从每个这些新闻页面中提取包括元数据的数据。该方法使用设计的算法,该算法应用正则表达式(regexes)来识别用于从这些新闻文档中提取实际文本内容的正确模式。提议的方法处理任何大小的新闻网页,并以高效和高精度提取核心内容。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号