ECON: An Approach to Extract Content from Web News Page

机译：ECON：一种从Web新闻页面提取内容的方法

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

This paper provides a simple but effective approach, named ECON, to fully-automatically extract content from Web news page. ECON uses a DOM tree to represent the Web news page and leverages the substantial features of the DOM tree. ECON finds a snippet-node by which a part of the content of news is wrapped firstly, then backtracks from the snippet-node until a summary-node is found, and the entire content of news is wrapped by the summary-node. During the process of backtracking, ECON removes noise. Experimental results showed that ECON can achieve high accuracy and fully satisfy the requirements for scalable extraction. Moreover, ECON can be applied to Web news page written in many popular languages such as Chinese, English, French, German, Italian, Japanese, Portuguese, Russian, Spanish, Arabic. ECON can be implemented much easily.

机译：本文提供了一种简单而有效的方法，称为ECON，可以从Web新闻页面中自动提取内容。 ECON使用DOM树表示Web新闻页面，并利用DOM树的实质功能。 ECON找到一个片段节点，首先将其包裹新闻内容的一部分，然后从该片段节点回溯，直到找到一个摘要节点，然后将整个新闻内容都由摘要节点包裹。在回溯过程中，ECON会消除噪音。实验结果表明，ECON可以达到较高的精度，完全满足可扩展提取的要求。此外，ECON可以应用于以许多流行语言（例如中文，英语，法语，德语，意大利语，日语，葡萄牙语，俄语，西班牙语，阿拉伯语）编写的Web新闻页面。 ECON可以轻松实施。

著录项

来源
《12th Asia Pacific Web Conference (APWeb 2010)》|2010年|P.314-320|共7页
会议地点 Busan(KR);Busan(KR)
作者
Guo Yan; Tang Huifeng; Song Linhai; Wang Yu; Ding Guodong;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类计算机网络;
关键词
Web content extraction; Web mining; information extraction;

机译：Web内容提取； Web挖掘；信息提取;

相似文献

外文文献
中文文献
专利

1. A Layout Based Detachment Approach for Extracting Content from Webpages | Science Publications [J] . Anna Saro Vijendran, Deepa Chandran American journal of applied sciences . 2015,第6期

机译：基于布局的分离方法从网页中提取内容科学出版物
2. A Layout Based Detachment Approach for Extracting Content from Webpages [J] . Deepa Chandran, Anna Saro Vijendran American journal of applied sciences . 2015,第6期

机译：基于布局的分离方法从网页中提取内容
3. A hybrid approach for extracting informative content from web pages [J] . Erdinc Uzun, Hayri Volkan Agun, Tarik Yerlikaya Information Processing & Management . 2013,第4期

机译：从网页提取信息内容的混合方法
4. A Novel Approach To Automatically Extracting Main Content of Web News [C] . Xuan Wang, Weiping Wang, Bowen Liu, E-Business and Information System Security, 2009. EBISS '09 . 2009

机译：一种自动提取网络新闻主要内容的新方法
5. The state of women's sports on the web: Content analyses of international sports news websites and athletes' Twitter profiles. [D] . Coche, Roxane. 2013

机译：网络上的女子体育状况：国际体育新闻网站的内容分析和运动员的Twitter个人资料。
6. What say ye gout experts? a content analysis of questions about gout posted on the social news website Reddit [O] . Christina Derksen, Anna Serlachius, Keith J. Petrie, 2017

机译：痛风专家怎么说？在社交新闻网站Reddit上发布的有关痛风问题的内容分析
7. A Study on Extracting News Contents from News Web Pages [O] . Yong-Gu Lee 2009

机译：从新闻网页提取新闻内容的研究

ECON: An Approach to Extract Content from Web News Page

摘要

著录项

相似文献

相关主题

期刊订阅