Automatic Web News Content Extraction Based on Similar Pages

机译：基于相似页面的自动Web新闻内容提取

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Today most news pages are generated from some underlying structured source, so we think that template-dependent wrappers should be more suitable for them than template-independent wrappers. In this paper, we propose a novel automatic template-dependent Web news content extraction approach based on similar pages. Firstly, We choose two similar pages as training samples and represent them as two HTML DOM trees. Secondly, we create the maximum matching tree between the DOM trees using our simple tree matching and backtracking algorithm. Then, by analyzing the characteristics of nodes in the maximum matching tree, we eliminate the noise nodes to generate an extraction template. Finally, we build a template-dependent wrapper for target news pages whose structures are similar to the samples. Experimental results indicate that our approach is effective and efficient for Web news content extraction, and the average harmonic mean of precision and recall reaches 98.3% .

机译：如今，大多数新闻页面都是从某些底层结构化来源生成的，因此我们认为，与模板无关的包装器比与模板无关的包装器更适合它们。在本文中，我们提出了一种基于相似页面的新颖的依赖模板的自动Web新闻内容提取方法。首先，我们选择两个相似的页面作为训练样本，并将它们表示为两个HTML DOM树。其次，我们使用简单的树匹配和回溯算法在DOM树之间创建最大匹配树。然后，通过分析最大匹配树中节点的特征，我们消除了噪声节点以生成提取模板。最后，我们为目标新闻页面构建模板相关的包装程序，这些目标新闻页面的结构与示例相似。实验结果表明，该方法对网络新闻内容的提取是有效和高效的，其准确率和查全率的平均谐波均值达到98.3％。

著录项

来源
《2010 International Conference on Web Information Systems and Mining》|2010年|p.232-236|共5页
会议地点
作者
Zhang Chunyuan; Lin Zhiyang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类计算机网络;
关键词
Web news content extraction; similar pages; simple tree matching and backtracking algorithm; template-dependent wrapper;

机译：Web新闻内容提取;相似页面;简单的树匹配和回溯算法;模板相关的包装器;

相似文献

外文文献
中文文献
专利

1. Content annotation for the semantic web: an automatic web-based approach [J] . David Sanchez, David Isern, Miquel Millan Knowledge and information systems . 2011,第3期

机译：语义Web的内容注释：一种基于Web的自动方法
2. Content annotation for the semantic web: an automatic web-based approach [J] . David Sánchez, David Isern, Miquel Millan Knowledge and Information Systems . 2011,第3期

机译：语义Web的内容注释：一种基于Web的自动方法
3. Content extraction from news web pages using tag tree [J] . Chandrakala Arya, Sanjay K. Dwivedi International Journal of Autonomic Computing . 2018,第1期

机译：使用标签树从新闻网页提取的内容提取
4. Automatic Web News Extraction Based on DS Theory Considering Content Topics [C] . Kaihang Zhang, Chuang Zhang, Xiaojun Chen, International conference on computational science . 2018

机译：考虑内容主题的基于DS理论的网络新闻自动提取
5. Automatic feature extraction from tennis videos for content based retrieval. [D] . Raya, Thejaswi Hanumantha. 2011

机译：从网球视频中自动提取特征以进行基于内容的检索。
6. Automatic Detection of Pornographic and Gambling Websites Based on Visual and Textual Content Using a Decision Mechanism [O] . Yang Chen, Rongfeng Zheng, Anmin Zhou, 2020

机译：基于使用决策机制的视觉和文本内容自动检测色情和赌博网站
7. Visualizing Chronological Development of Disaster Based on Automatic Keyword Extraction from Web News on Disasters and Crises Using TRENDREADER(TR) [O] . Shosuke SATO, Haruo HAYASHI, Kazuharu INOUE, 2008

机译：基于自动关键词提取的Web新闻从Web新闻的灾难和危机使用TrendReader（TR）来了解灾害的按时间顺序

Automatic Web News Content Extraction Based on Similar Pages

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅