首页> 外文期刊>Information retrieval >An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model
【24h】

An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

机译:基于本体的二进制分类方法,利用概率检索模型识别多记录Web文档

获取原文
获取原文并翻译 | 示例
           

摘要

The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a "record." This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (ⅰ) index terms, (ⅱ) density value, and (ⅲ) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.
机译:Web包含大量信息。确定哪些Web文档与用户查询相关具有挑战性,并且根据它们的相关程度对它们进行排名更具挑战性。在本文中,我们提出了一种使用逻辑回归的概率检索模型,用于根据应用程序本体识别多记录Web文档,这是一种简单的概念建模方法。我们注意到许多Web文档包含一系列文本信息,每个文本信息构成一个“记录”。这种类型的文档称为多记录文档。在我们的分类方法中,文档由一组索引词的词频,密度启发式值和分组启发式值表示。我们首先使用每个训练文档的(ⅰ)索引项,(ⅱ)密度值和(ⅲ)分组值对相关概率进行逻辑回归分析。此后,从拟合曲线内插每个测试文件的相关概率。与其他概率检索模型相反,我们的模型仅做出一个微弱的独立假设,并且能够处理索引词之间的任何重要依赖关系。此外,由于训练文档的相关概率是离散的,因此我们使用逻辑回归而不是线性回归分析。通过使用一组汽车测试和另一组用于for告Web文档的测试,我们的概率模型实现了平均召回率100%,准确率83.3%和准确率92.5%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号