Pattern matching for extraction of core contents from news web pages

机译：模式匹配，用于从新闻网页中提取核心内容

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web pages, besides core contents, consist of other elements, such as banners, navigational elements, copyright information, external links, etc. This noisy content covers more area of web pages and is typically not related to the main subjects of the web pages. Most of the information available on web pages is either represented in XML, or HTML, or XHTML format that mostly contains semi-structured text documents, which lacks formatted document structure. This document does not discriminate between the text and the schema, and the amount of structure used to represent the text depends on the purpose. No semantic is applied to semi-structured documents. This requires extracting core contents of text document to analyse words or sentences for retrieving relevant information. Although there are many existing methods that formulate the actual content identification problem as a DOM tree node selection problem, each one has some sort of lacunae. Here we proposed an approach based on pattern matching technique. This technique uses simple heuristic for extraction of core contents from web pages which are mostly semi-structured in nature. It requires visiting the appropriate news web site using their URL, accessing the links related to each news page of specified category, extracting the data including metadata from each of these news web pages. The approach uses devised algorithm that applies regular expressions (regexes) to identify the correct pattern for extracting the actual text contents from these news documents. Proposed approach deals with news web pages of any size and extracts core contents with efficiency and high accuracy.

机译：网页除核心内容外，还包含其他元素，例如横幅，导航元素，版权信息，外部链接等。这种嘈杂的内容覆盖了网页的更多区域，通常与网页的主要主题无关。网页上可用的大多数信息都以XML或HTML或XHTML格式表示，其中大多数包含半结构化的文本文档，而缺少格式化的文档结构。本文档没有区分文本和架构，用于表示文本的结构量取决于目的。没有语义应用于半结构化文档。这需要提取文本文档的核心内容来分析单词或句子以检索相关信息。尽管存在许多将实际内容标识问题表达为DOM树节点选择问题的方法，但是每种方法都有某种缺陷。在这里，我们提出了一种基于模式匹配技术的方法。该技术使用简单的启发式方法从本质上大多为半结构化的网页中提取核心内容。它要求使用其URL访问适当的新闻网站，访问与指定类别的每个新闻页面相关的链接，并从每个这些新闻页面中提取包括元数据的数据。该方法使用设计的算法，该算法应用正则表达式（regexes）来识别用于从这些新闻文档中提取实际文本内容的正确模式。提议的方法处理任何大小的新闻网页，并以高效和高精度提取核心内容。

著录项

来源
《International Conference on Web Research》|2016年|13-18|共6页
会议地点
作者
Sandeep Sirsat; Vinay Chavan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Document Object Module; Information extraction; Pattern matching; tags;

机译：文档对象模块;信息提取;模式匹配;标签;

相似文献

外文文献
中文文献
专利

1. Content extraction from news web pages using tag tree [J] . Chandrakala Arya, Sanjay K. Dwivedi International Journal of Autonomic Computing . 2018,第1期

机译：使用标签树从新闻网页提取的内容提取
2. Extraction of Core Contents from Web Pages [J] . Sandeep Sirsat International Journal of Engineering Trends and Technology . 2014,第9期

机译：从网页中提取核心内容
3. Extraction of Frequent Sequential Patterns From Web Usage Data and Their Applications In Pre-Fetching Rules Generation For Effective Web Latency Reduction [J] . Badong Chen, Yueqin Zhu Advances in applied computational mechanics . 2018,第1期

机译：提取Web使用数据的频繁顺序模式及其在预取规则生成中的应用程序，以实现有效的Web等待时间
4. Pattern matching for extraction of core contents from news web pages [C] . Sandeep Sirsat, Vinay Chavan International Conference on Web Research . 2016

机译：从新闻网页提取核心内容的模式匹配
5. Combined mining of Web server logs and Web contents for classifying user navigation patterns and predicting users' future requests. [D] . Liu, Haibin. 2005

机译：结合挖掘Web服务器日志和Web内容，以对用户导航模式进行分类并预测用户的未来请求。
6. Size-specific interaction patterns and size matching in a plant–pollinator interaction web [O] . Martina Stang, Peter G. L. Klinkhamer, Nickolas M. Waser, 2009

机译：植物-授粉媒介相互作用网中特定大小的相互作用模式和大小匹配
7. Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns. The Development and Evaluation of New Web Mining Methods that enhance Information Retrieval and improve the Understanding of User¿s Web Behavior in Websites and Social Blogs. [O] . Ammari Ahmad N. 2010

机译：通过新颖的挖掘技术将用户数据转化为用户价值，以提取Web内容，结构和使用模式。新的Web挖掘方法的开发和评估，该方法可增强信息检索和增进对网站和社交博客中用户Web行为的理解。

Pattern matching for extraction of core contents from news web pages

摘要

著录项

相似文献

相关主题

期刊订阅