首页> 外文会议>International Conference on Semantics, Knowledge and Grid >Searching for Historical Events on a Large-Scale Web Archive
【24h】

Searching for Historical Events on a Large-Scale Web Archive

机译:在大型Web档案馆上寻找历史事件

获取原文

摘要

Finding knowledge on the Web has long been a hot research issue. Today the Web has become a popular medium for publishing news and opinion articles, which are important carriers of human knowledge, especially of social knowledge. Developing techniques of automatically collecting and analysing these articles on a large scale is thus desirable. In this paper we propose techniques for searching for events on the Web, and our techniques have been tested on a large scale web archive. Given an event, or a news topic cared by many people, the purpose of this paper is to find out near-all news stories related to it. First, a novel domain-independent approach of extracting news stories from web pages is proposed which is based on anchor text and is applicable to most websites. Experiments show our approach performs good and is better than another approach we have found. Second, a domain-based method of representing events is proposed in which hundreds of keywords are used to represent an event and compose the query expression. This situation of retrieval is different from most search engines' in that the number of keywords is large. We then propose several retrieval algorithms based on BM25 for the method. Evaluation show that these algorithms perform better than unmodified BM25 in our situation and the best one is chosen as the algorithm of our system. Finally an experimental system has been built on a collection of 2 billion web pages and the running performance is reported, which shows the effectiveness of our approaches.
机译:寻找关于网络的知识长期以来一直是一个热门的研究问题。今天,网络已成为出版新闻和意见文章的热门媒介,这是人类知识的重要载体,尤其是社会知识。因此,可以理解在大规模上自动收集和分析这些制品的技术。在本文中,我们提出了搜索网络事件的技术,并且我们的技术已经在大规模的Web档案上进行了测试。鉴于事件或许多人关心的新闻主题,本文的目的是找出与其相关的近乎所有新闻报道。首先,提出了一种从网页提取新闻故事的新型域的独立方法,其基于锚文本,并且适用于大多数网站。实验表明我们的方法表现得很好,比我们发现的另一种方法更好。其次,提出了一种代表事件的基于域的方法,其中用于表示事件并撰写查询表达式的数百个关键字。这种检索情况与大多数搜索引擎的情况不同,因为关键字的数量很大。然后,我们为该方法提出了基于BM25的几种检索算法。评估表明,这些算法在我们的情况下比未修改的BM25更好,并且最好选择作为我们系统的算法。最后,建立了一个实验系统,建立了20亿个网页的集合,报告了运行的性能,这表明了我们方法的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号