首页> 外文会议>IEEE 2nd Symposium on Web Society >HisTrace: A system for mining on news-related articles instead of web pages
【24h】

HisTrace: A system for mining on news-related articles instead of web pages

机译:HisTrace:一种用于挖掘新闻相关文章而非网页的系统

获取原文

摘要

The Web is now playing an important part in people's real-life activities. Scientists of not only computer science but also sociology and economics might be interested in mining on information directly related to real-life events, or news-related information on the Web. In this paper we propose a system to enable mining on news-related articles instead of raw web pages. There are functionally two tasks in our system: 1) mining for news-related articles and 2) duplicate elimination. For the first task, a novel approach for determining titles, contents and publication-times of news-related articles is presented. Anchor texts are firstly used to extract titles from HTML bodies and then contents are extracted right after titles. After that, crawl-times and are used to initially compute publication-times for all articles. At last, times extracted from HTML bodies, URLs and anchor texts are used to determine precise publication-times for possible articles. For the second task, a duplicate detection algorithm for news-related articles is described which is base on LCS (longest common subsequence) and achieves both high precision and high recall. The framework of this algorithm has been presented as a general-purpose algorithm for web pages in a previously published paper. In this paper we explain why this algorithm is particularly suitable for news-related articles and present corresponding implementation details. Evaluations have been conducted which show the effectiveness of our approaches.
机译:None

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号