Automatic Elements Extraction of Chinese Web News Using Prior Information of Content and Structure

机译：利用内容和结构的先验信息自动提取中文网络新闻元素

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a set of efficient processes for extracting all four elements of Chinese news web pages, namely news title, release date, news source and the main text. Our approach is based on a deep analysis of content and structure features of current Chinese news. We take content indicators as the key to recover tree structure of the main text. Additionally, we come up with the concept of Length-Distance Ratio to help improve performance. Our method rarely depends on selection of samples and has strong generalization ability regardless of training process, distinguishing itself from most existing methods. We have tested our approach on 1721 labeled Chinese news pages from 429 web sites. Results show that an 87% accuracy was achieved for news source extraction, and over 95% accuracy for other three elements.

机译：我们提出了一套有效的方法来提取中文新闻网页的所有四个元素，即新闻标题，发布日期，新闻来源和正文。我们的方法基于对当前中国新闻的内容和结构特征的深入分析。我们将内容指标作为恢复正文树结构的关键。此外，我们提出了“长距比”的概念以帮助提高性能。我们的方法很少依赖样本的选择，并且具有很强的泛化能力，而与训练过程无关，这使其与大多数现有方法有所不同。我们已经对来自429个网站的1721个带有中文标签的新闻页面测试了我们的方法。结果表明，新闻源提取的准确性达到87％，其他三个元素的准确性超过95％。

著录项

来源
《2013 2nd IAPR Asian Conference on Pattern Recognition》|2013年|340-345|共6页
会议地点 Naha(JP)
作者
Chengru Song; Shifeng Weng; Changshui Zhang;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种 eng
中图分类
关键词
LDR; TF-IDF; news extraction; term vector model;

机译：LDR; TF-IDF;新闻提取;术语向量模型;;

相似文献

外文文献
中文文献
专利

1. Automatic Extraction of Objects and their Attributes from Semi-Structured Web Tables for E-commerce Tasks [J] . Yerzhan Baiburin, Aliya Nugumanova Indian Journal of Science and Technology . 2015,第30期

机译：从半结构化Web表中自动提取对象及其属性以完成电子商务任务
2. Automatic sitemaps generation: Exploring website structures using block extraction and hyperlink analysis [J] . Shian-Hua Lin, Kuan-Pak Chu, Chun-Ming Chiu Expert Systems with Application . 2011,第4期

机译：自动生成站点地图：使用块提取和超链接分析来探索网站结构
3. Automatic information extraction from semi-structured Web pages by pattern discovery [J] . Chia-Hui Chang, Chun-Nan Hsu, Shao-Cheng Lui Decision support systems . 2003,第1期

机译：通过模式发现从半结构化网页中自动提取信息
4. Automatic Elements Extraction of Chinese Web News Using Prior Information of Content and Structure [C] . Chengru Song, Shifeng Weng, Changshui Zhang IAPR Asian Conference on Pattern Recognition . 2013

机译：使用现有内容和结构信息的自动元素提取中文网络新闻
5. Images of China and the United States in each other's newspapers: A visual content analysis of Chinese and U.S. newspapers. [D] . Wu, Gang. 2006

机译：彼此报纸上的中美图像：对中美报纸的视觉内容分析。
6. Solvent Front Position Extraction with Semi-Automatic Device as a Powerful Sample Preparation Procedure Prior to Quantitative Instrumental Analysis [O] . Anna Klimek-Turek, Kamila Jaglińska, Magdalena Imbierowicz, 2019

机译：在定量仪器分析之前使用半自动装置进行溶剂前位萃取作为强大的样品制备程序
7. Structured data extraction: separating content from noise on news websites [O] . Arizaleta Mikel 2009

机译：结构化数据提取：将新闻网站上的内容与噪音分离

Automatic Elements Extraction of Chinese Web News Using Prior Information of Content and Structure

摘要

著录项

相似文献

相关主题

期刊订阅