首页> 外文会议>IAPR Asian Conference on Pattern Recognition >Automatic Elements Extraction of Chinese Web News Using Prior Information of Content and Structure
【24h】

Automatic Elements Extraction of Chinese Web News Using Prior Information of Content and Structure

机译:使用现有内容和结构信息的自动元素提取中文网络新闻

获取原文

摘要

We propose a set of efficient processes for extracting all four elements of Chinese news web pages, namely news title, release date, news source and the main text. Our approach is based on a deep analysis of content and structure features of current Chinese news. We take content indicators as the key to recover tree structure of the main text. Additionally, we come up with the concept of Length-Distance Ratio to help improve performance. Our method rarely depends on selection of samples and has strong generalization ability regardless of training process, distinguishing itself from most existing methods. We have tested our approach on 1721 labeled Chinese news pages from 429 web sites. Results show that an 87% accuracy was achieved for news source extraction, and over 95% accuracy for other three elements.
机译:我们提出了一系列高效的流程,用于提取中国新闻网页的所有四个元素,即新闻标题,发布日期,新闻来源和主要文本。我们的方法是基于当前中国新闻的内容和结构特征深入分析。我们将内容指示器作为恢复主要文本树结构的关键。此外,我们提出了长距离比的概念,以帮助提高性能。我们的方法很少取决于样本的选择,并且具有强烈的泛化能力,而不管培训过程如何,将自己与大多数现有方法区分开来。我们在1721年测试了我们的方法,标有429个网站标记为中国新闻页面。结果表明,新闻源提取实现了87%的准确性,其他三个元素的准确性超过95%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号