Pattern matching for extraction of core contents from news web pages

机译：从新闻网页提取核心内容的模式匹配

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web pages, besides core contents, consist of other elements, such as banners, navigational elements, copyright information, external links, etc. This noisy content covers more area of web pages and is typically not related to the main subjects of the web pages. Most of the information available on web pages is either represented in XML, or HTML, or XHTML format that mostly contains semi-structured text documents, which lacks formatted document structure. This document does not discriminate between the text and the schema, and the amount of structure used to represent the text depends on the purpose. No semantic is applied to semi-structured documents. This requires extracting core contents of text document to analyse words or sentences for retrieving relevant information. Although there are many existing methods that formulate the actual content identification problem as a DOM tree node selection problem, each one has some sort of lacunae. Here we proposed an approach based on pattern matching technique. This technique uses simple heuristic for extraction of core contents from web pages which are mostly semi-structured in nature. It requires visiting the appropriate news web site using their URL, accessing the links related to each news page of specified category, extracting the data including metadata from each of these news web pages. The approach uses devised algorithm that applies regular expressions (regexes) to identify the correct pattern for extracting the actual text contents from these news documents. Proposed approach deals with news web pages of any size and extracts core contents with efficiency and high accuracy.

机译：网页，除了核心的内容，包含其他元素，如横幅，导航元素，版权信息，外部链接等网页这嘈杂的内容涵盖多个领域，并且通常是不相关的网页的主要议题。 Web页面上可用的大多数信息由XML或HTML或XHTML格式表示，主要包含半结构化文本文档，这些文本文档缺少格式化的文档结构。本文档不区分文本和模式，并且用于表示文本的结构量取决于目的。没有语义应用于半结构化文件。这需要提取文本文档的核心内容，以分析检索相关信息的单词或句子。虽然存在许多现有方法，其将实际内容识别问题作为DOM树节点选择问题，但每个都有一些Lacunae。在这里，我们提出了一种基于模式匹配技术的方法。该技术使用简单的启发式从网页提取核心内容，这些网页主要是半结构化的。它需要使用其URL访问相应的新闻网站，访问与指定类别的每个新闻页面相关的链接，从这些新闻网页中提取包括元数据的数据。该方法使用设计了应用正则表达式（正则表达式）来识别从这些新闻文档中提取实际文本内容的正确模式。建议的方法处理任何大小的新闻网页，并以效率和高精度提取核心内容。

著录项

来源
《International Conference on Web Research》|2016年||共6页
会议地点
作者
Sandeep Sirsat; Vinay Chavan;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词
Document Object Module; Information extraction; Pattern matching; tags;

机译：文档对象模块;信息提取;模式匹配;标签;

相似文献

外文文献
中文文献
专利

1. Content extraction from news web pages using tag tree [J] . Chandrakala Arya, Sanjay K. Dwivedi International Journal of Autonomic Computing . 2018,第1期

机译：使用标签树从新闻网页提取的内容提取
2. Extraction of Core Contents from Web Pages [J] . Sandeep Sirsat International Journal of Engineering Trends and Technology . 2014,第9期

机译：从网页中提取核心内容
3. Extraction of Frequent Sequential Patterns From Web Usage Data and Their Applications In Pre-Fetching Rules Generation For Effective Web Latency Reduction [J] . Badong Chen, Yueqin Zhu Advances in applied computational mechanics . 2018,第1期

机译：提取Web使用数据的频繁顺序模式及其在预取规则生成中的应用程序，以实现有效的Web等待时间
4. Pattern matching for extraction of core contents from news web pages [C] . Sandeep Sirsat, Vinay Chavan International Conference on Web Research . 2016

机译：模式匹配，用于从新闻网页中提取核心内容
5. Combined mining of Web server logs and Web contents for classifying user navigation patterns and predicting users' future requests. [D] . Liu, Haibin. 2005

机译：结合挖掘Web服务器日志和Web内容，以对用户导航模式进行分类并预测用户的未来请求。
6. Size-specific interaction patterns and size matching in a plant–pollinator interaction web [O] . Martina Stang, Peter G. L. Klinkhamer, Nickolas M. Waser, 2009

机译：植物-授粉媒介相互作用网中特定大小的相互作用模式和大小匹配
7. Transforming user data into user value by novel mining techniques for extraction of web content, structure and usage patterns. The Development and Evaluation of New Web Mining Methods that enhance Information Retrieval and improve the Understanding of User¿s Web Behavior in Websites and Social Blogs. [O] . Ammari Ahmad N. 2010

机译：通过新颖的挖掘技术将用户数据转化为用户价值，以提取Web内容，结构和使用模式。新的Web挖掘方法的开发和评估，该方法可增强信息检索和增进对网站和社交博客中用户Web行为的理解。

Pattern matching for extraction of core contents from news web pages

摘要

著录项

相似文献

相关主题

期刊订阅