A theory of indexing is presented and is based on viewing a document as constituted of components. A component may be chosen as any run of text unit that can be: (a) judged as to its relevancy property; and (b) considered as independent within the document. By looking at the constituent components of a document in relation to the universe of all components from the collection, we have been able to apply Bayes' decision theory to derive the index term representation for the document, as well as attaching an initial probabilistic weight for each term based on a Principle of Document Self-Recovery. It turns out that different choices of document components, such as a word or a whole abstract, can lead to different term weighting schemes that have been introduced before and are based on probability considerations; specifically, Edmundson and Wyllys' term significance formula, Sparck Jones' inverse document frequency, and later modified by Croft and Harper into the 'combination match' formula. Thus,a unified interpretation of various probabilistic term weighting schemes appears possible.
机译:基于术语排序和含术语语义关系的模糊逻辑的文档表示新术语加权方案
机译:术语频率-文档频率的功能:企业信息检索的新术语加权方案
机译:术语加权计划和相似性措施对提取多文件文本摘要的影响
机译:一种基于类别特定文档频率的新术语加权方案,用于文档表示和分类
机译:通过支持术语的单个基于文档的术语加权方案。
机译:生物医学信息检索的词性项加权算法
机译:一种基于图形的多文件概述中的多重加权方案和排序方法的方法
机译:利用监督期限加权方案改进预分类收集检索。