首页> 外文期刊>Applied Soft Computing >An improved focused crawler based on Semantic Similarity Vector Space Model
【24h】

An improved focused crawler based on Semantic Similarity Vector Space Model

机译:基于语义相似度向量空间模型的改进型聚焦爬虫

获取原文
获取原文并翻译 | 示例
           

摘要

A focused crawler is topic-specific and aims selectively to collect web pages that are relevant to a given topic from the Internet. In many studies, the Vector Space Model (VSM) and Semantic Similarity Retrieval Model (SSRM) take advantage of cosine similarity and semantic similarity to compute similarities between web pages and the given topic. However, if there are no common terms between a web page and the given topic, the VSM will not obtain the proper topical similarity of the web page. In addition, if all of the terms between them are synonyms, then the SSRM will also not obtain the proper topical similarity. To address these problems, this paper proposes an improved retrieval model, the Semantic Similarity Vector Space Model (SSVSM), which integrates the TF*IDF values of the terms and the semantic similarities among the terms to construct topic and document semantic vectors that are mapped to the same double-term set, and computes the cosine similarities between these semantic vectors as topic-relevant similarities of documents, including the full texts and anchor texts of unvisited hyperlinks. Next, the proposed model predicts the priorities of the unvisited hyperlinks by integrating the full text and anchor text topic-relevant similarities. The experimental results demonstrate that this approach improves the performance of the focused crawlers and outperforms other focused crawlers based on Breadth-First, VSM and SSRM. In conclusion, this method is significant and effective for focused crawlers. (C) 2015 Elsevier B.V. All rights reserved.
机译:聚焦爬虫是特定于主题的,旨在有选择地从Internet收集与给定主题相关的网页。在许多研究中,向量空间模型(VSM)和语义相似度检索模型(SSRM)利用余弦相似度和语义相似度来计算网页与给定主题之间的相似度。但是,如果网页和给定主题之间没有通用术语,则VSM将无法获得网页的适当主题相似性。另外,如果它们之间的所有术语都是同义词,那么SSRM也将不会获得适当的主题相似性。为了解决这些问题,本文提出了一种改进的检索模型,即语义相似度向量空间模型(SSVSM),该模型整合了术语的TF * IDF值和术语之间的语义相似度,以构建主题和文档的语义向量。到相同的双向集合,并计算这些语义向量之间的余弦相似度,作为文档的主题相关相似度,包括未访问超链接的全文本和锚文本。接下来,提出的模型通过整合全文和锚文本主题相关的相似性来预测未访问超链接的优先级。实验结果表明,该方法提高了集中爬虫的性能,并且优于基于广度优先,VSM和SSRM的其他集中爬虫。综上所述,该方法对于集中式爬虫非常有效。 (C)2015 Elsevier B.V.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号