首页> 外文期刊>Computer Science & Information Technology >Topic Modeling : Clustering of Deep Webpages
【24h】

Topic Modeling : Clustering of Deep Webpages

机译:主题建模:深度网页的聚类

获取原文
           

摘要

The internet is comprised of massive amount of information in the form of zillions of webpages.This information can be categorized into the surface web and the deep web. The existingsearch engines can effectively make use of surface web information.But the deep web remainsunexploited yet. Machine learning techniques have been commonly employed to access deepweb content.Under Machine Learning, topic models provide a simple way to analyze large volumes ofunlabeled text. A "topic" consists of a cluster of words that frequently occur together. Usingcontextual clues, topic models can connect words with similar meanings and distinguishbetween words with multiple meanings. Clustering is one of the key solutions to organize thedeep web databases.In this paper, we cluster deep web databases based on the relevance foundamong deep web forms by employing a generative probabilistic model called Latent DirichletAllocation(LDA) for modeling content representative of deep web databases. This isimplemented after preprocessing the set of web pages to extract page contents and formcontents.Further, we contrive the distribution of “topics per document” and “words per topic”using the technique of Gibbs sampling. Experimental results show that the proposed methodclearly outperforms the existing clustering methods.
机译:互联网由成千上万的网页形式的大量信息组成,这些信息可以分为表面网络和深层网络。现有的搜索引擎可以有效地利用表面网络信息。但是,深层网络尚未得到开发。机器学习技术已普遍用于访问深网内容。在机器学习下,主题模型提供了一种分析大量未标记文本的简单方法。 “主题”由经常一起出现的一组单词组成。使用上下文线索,主题模型可以将具有相似含义的单词连接起来,并在具有多种含义的单词之间进行区分。聚类是组织深层Web数据库的关键解决方案之一。在本文中,我们通过使用称为Latent DirichletAllocation(LDA)的生成概率模型对深层Web数据库的内容进行建模,基于深层Web表单之间的相关性对深层Web数据库进行聚类。 。这是在对网页集进行预处理以提取页面内容和表单内容之后实现的。此外,我们使用Gibbs采样技术来设计“每个文档的主题”和“每个主题的单词”的分布。实验结果表明,该方法明显优于现有的聚类方法。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号