首页> 外文期刊>Information Processing & Management >PREFCA: A portal retrieval engine based on formal concept analysis
【24h】

PREFCA: A portal retrieval engine based on formal concept analysis

机译:PREFCA:基于形式概念分析的门户检索引擎

获取原文
获取原文并翻译 | 示例

摘要

The web is a network of linked sites whereby each site either forms a physical portal or a standalone page. In the former case, the portal presents an access point to its embedded web pages that coherently present a specific topic. In the latter case, there are millions of standalone web pages, that are scattered throughout the web, having the same topic and could be conceptually linked together to form virtual portals. Search engines have been developed to help users in reaching the adequate pages in an efficient and effective manner. All the known current search engine techniques rely on the web page as the basic atomic search unit. They ignore the conceptual links, that reveal the implicit web related meanings, among the retrieved pages. However, building a semantic model for the whole portal may contain more semantic information than a model of scattered individual pages. In addition, user queries can be poor and contain imprecise terms that do not reflect the real user intention. Consequently, retrieving the standalone individual pages that are directly related to the query may not satisfy the user's need. In this paper, we propose PREFCA, a Portal Retrieval Engine based on Formal Concept Analysis that relies on the portal as the main search unit. PREFCA consists of three phases: First, the information extraction phase that is concerned with extracting portal's semantic data. Second, the formal concept analysis phase that utilizes formal concept analysis to discover the conceptual links among portal and attributes. Finally, the information retrieval phase where we propose a portal ranking method to retrieve ranked pairs of portals and embedded pages. Additionally, we apply the network analysis rules to output some portal characteristics. We evaluated PREFCA using two data sets, namely the Forum for Information Retrieval Evaluation 2010 and ClueWeb09 (category B) test data, for physical and virtual portals respectively. PREFCA proves higher F-measure accuracy, better Mean Average Precision ranking and comparable network analysis and efficiency results than other search engine approaches, namely Term Frequency Inverse Document Frequency (TF-IDF), Latent Semantic Analysis (LSA), and BM25 techniques. As well, it gains high Mean Average Precision in comparison with learning to rank techniques. Moreover, PREFCA also gains better reach time than Carrot as a well-known topic-based search engine.
机译:网络是链接站点的网络,由此每个站点都可以形成物理门户或独立页面。在前一种情况下,门户网站提供了对其嵌入式网页的访问点,这些网页连贯地呈现了特定主题。在后一种情况下,有数百万个独立的网页,这些网页散布在整个网络中,具有相同的主题,并且可以在概念上链接在一起以形成虚拟门户。已经开发出搜索引擎来帮助用户以有效和有效的方式到达适当的页面。当前所有已知的搜索引擎技术都依赖于网页作为基本的原子搜索单元。他们忽略了检索到的页面之间的概念链接,这些链接揭示了与Web相关的隐式含义。但是,为整个门户网站建立语义模型可能比散布各个页面的模型包含更多的语义信息。另外,用户查询可能很差,并且包含的​​不精确术语不能反映真实的用户意图。因此,检索与查询直接相关的独立单个页面可能无法满足用户的需求。在本文中,我们提出了PREFCA,这是一种基于形式概念分析的门户检索引擎,它以门户为主要搜索单元。 PREFCA包括三个阶段:第一,与提取门户网站的语义数据有关的信息提取阶段。其次,正式概念分析阶段,该阶段利用正式概念分析来发现门户网站和属性之间的概念链接。最后,在信息检索阶段,我们提出一种门户排名方法,以检索门户和嵌入式页面的排名对。此外,我们应用网络分析规则来输出一些门户特征。我们使用两个数据集分别评估了物理门户和虚拟门户的PREFCA,分别是信息检索论坛评估2010和ClueWeb09(类别B)测试数据。与其他搜索引擎方法(术语频率倒文档频率(TF-IDF),潜在语义分析(LSA)和BM25技术)相比,PREFCA证明了更高的F测量精度,更好的平均平均精度排名以及可比的网络分析和效率结果。同样,与学习排名技术相比,它具有较高的平均平均精度。此外,与著名的基于主题的搜索引擎Carrot相比,PREFCA的到达时间也更长。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号