首页> 外文期刊>Information retrieval >Graph-based term weighting for information retrieval
【24h】

Graph-based term weighting for information retrieval

机译:基于图的词权重信息检索

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

A standard approach to Information Retrieval (IR) is to model text as a bag of words. Alternatively, text can be modelled as a graph, whose vertices represent words, and whose edges represent relations between the words, defined on the basis of any meaningful statistical or linguistic relation. Given such a text graph, graph theoretic computations can be applied to measure various properties of the graph, and hence of the text. This work explores the usefulness of such graph-based text representations for IR. Specifically, we propose a principled graph-theoretic approach of (1) computing term weights and (2) integrating discourse aspects into retrieval. Given a text graph, whose vertices denote terms linked by co-occurrence and grammatical modification, we use graph ranking computations (e.g. PageRank Page et al. in The pagerank citation ranking: Bringing order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998) to derive weights for each vertex, i.e. term weights, which we use to rank documents against queries. We reason that our graph-based term weights do not necessarily need to be normalised by document length (unlike existing term weights) because they are already scaled by their graph-ranking computation. This is a departure from existing IR ranking functions, and we experimentally show that it performs comparably to a tuned ranking baseline, such as BM25 (Robertson et al. in NIST Special Publication 500-236: TREC-4, 1995). In addition, we integrate into ranking graph properties, such as the average path length, or clustering coefficient, which represent different aspects of the topology of the graph, and by extension of the document represented as a graph. Integrating such properties into ranking allows us to consider issues such as discourse coherence, flow and density during retrieval. We experimentally show that this type of ranking performs comparably to BM25, and can even outperform it, across different TREC (Voorhees and Harman in TREC: Experiment and evaluation in information retrieval, MIT Press, 2005) datasets and evaluation measures.
机译:信息检索(IR)的标准方法是将文本建模为一袋单词。替代地,可以将文本建模为图形,其顶点表示单词,并且其边缘表示单词之间的关系,该关系基于任何有意义的统计或语言关系来定义。给定这样的文本图,可以将图理论计算应用于测量图的各种属性,从而测量文本的各种属性。这项工作探索这种基于图形的文本表示形式对IR的有用性。具体来说,我们提出了一种原则上的图论方法:(1)计算术语权重,(2)将话语方面整合到检索中。给定一个文本图,该图的顶点表示通过共现和语法修改链接的术语,我们使用图排名计算(例如,PageRank Page等,在pagerank引用排名:将顺序放到网络上。技术报告,斯坦福数字图书馆技术项目(1998年),以得出每个顶点的权重,即术语权重,我们将其用于根据查询对文档进行排名。我们认为,基于图的术语权重不一定需要通过文档长度进行归一化(不同于现有的术语权重),因为它们已经通过其图排名计算进行了缩放。这与现有的IR排名功能背道而驰,我们通过实验证明它的性能与调整后的排名基线(例如BM25)相当(Robertson等人,NIST Special Publication 500-236:TREC-4,1995)。此外,我们还集成了排名的图属性,例如平均路径长度或聚类系数,它们表示图拓扑的不同方面,并通过扩展表示为图的文档。将这些属性整合到排名中后,我们就可以考虑检索过程中的语篇连贯性,流程和密度等问题。我们通过实验表明,在不同的TREC(TREC中的Voorhees和Harman:信息检索中的实验和评估,麻省理工学院出版社,2005年)数据集和评估措施中,这种排名方式与BM25相当,甚至可以胜过BM25。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号