Topic Modeling : Clustering of Deep Webpages

Muhunthaadithya C; Rohit J.V; Sadhana Kesavan andDr. E. Sivasankar

首页> 外文期刊>Computer Science & Information Technology >Topic Modeling : Clustering of Deep Webpages

【24h】

Topic Modeling : Clustering of Deep Webpages

机译：主题建模：深度网页的聚类

获取原文

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

文献代查 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

The internet is comprised of massive amount of information in the form of zillions of webpages.This information can be categorized into the surface web and the deep web. The existingsearch engines can effectively make use of surface web information.But the deep web remainsunexploited yet. Machine learning techniques have been commonly employed to access deepweb content.Under Machine Learning, topic models provide a simple way to analyze large volumes ofunlabeled text. A "topic" consists of a cluster of words that frequently occur together. Usingcontextual clues, topic models can connect words with similar meanings and distinguishbetween words with multiple meanings. Clustering is one of the key solutions to organize thedeep web databases.In this paper, we cluster deep web databases based on the relevance foundamong deep web forms by employing a generative probabilistic model called Latent DirichletAllocation(LDA) for modeling content representative of deep web databases. This isimplemented after preprocessing the set of web pages to extract page contents and formcontents.Further, we contrive the distribution of “topics per document” and “words per topic”using the technique of Gibbs sampling. Experimental results show that the proposed methodclearly outperforms the existing clustering methods.

机译：互联网由成千上万的网页形式的大量信息组成，这些信息可以分为表面网络和深层网络。现有的搜索引擎可以有效地利用表面网络信息。但是，深层网络尚未得到开发。机器学习技术已普遍用于访问深网内容。在机器学习下，主题模型提供了一种分析大量未标记文本的简单方法。 “主题”由经常一起出现的一组单词组成。使用上下文线索，主题模型可以将具有相似含义的单词连接起来，并在具有多种含义的单词之间进行区分。聚类是组织深层Web数据库的关键解决方案之一。在本文中，我们通过使用称为Latent DirichletAllocation（LDA）的生成概率模型对深层Web数据库的内容进行建模，基于深层Web表单之间的相关性对深层Web数据库进行聚类。。这是在对网页集进行预处理以提取页面内容和表单内容之后实现的。此外，我们使用Gibbs采样技术来设计“每个文档的主题”和“每个主题的单词”的分布。实验结果表明，该方法明显优于现有的聚类方法。

著录项

来源
《Computer Science & Information Technology》 |2015年第13期|共9页
作者
Muhunthaadithya C; Rohit J.V; Sadhana Kesavan andDr. E. Sivasankar;
展开▼
作者单位

展开▼
收录信息
原文格式 PDF
正文语种
中图分类计算技术、计算机技术;
关键词
Latent Dirichlet AllocationLatent Semantic AnalysisDeep WebCosine SimilarityFormContent and Page Content.;

机译：潜在Dirichlet分配潜在语义分析深层Web余弦相似性表单内容和页面内容。;

相似文献

外文文献
中文文献
专利

1. Using eye-movement modelling examples to improve critical reading of multiple webpages on a conflicting topic [J] . Salmeron Ladislao, Delgado Pablo, Mason Lucia Journal of Computer Assisted Learning . 2020,第6期

机译：使用眼球运动建模示例来改善冲突主题的多个网页的临界读数
2. Clustering of Deep WebPages: A Comparative Study [J] . Muhunthaadithya C, Rohit J.V, Sadhana Kesavan, International Journal of Computer Science & Information Technology (IJCSIT) . 2015,第5期

机译：深度网页的聚类：比较研究
3. iVisClustering: An Interactive Visual Document Clustering via Topic Modeling [J] . Hanseung Lee, Jaeyeon Kihm, Jaegul Choo, Computer Graphics Forum: Journal of the European Association for Computer Graphics . 2012,第3aPta3期

机译：iVisClustering：通过主题建模的交互式可视文档聚类
4. Automatic topic identification using webpage clustering [C] . Xiaofeng He, Ding, C.H.Q., . 2001

机译：使用网页聚类自动识别主题
5. Topics in Clustering: Feature Selection and Semiparametric Modeling [D] . Pu, Xiao. 2017

机译：群集主题：特征选择和半甲酰型建模
6. Topic modeling for cluster analysis of large biological and medical datasets [O] . Weizhong Zhao, Wen Zou, James J Chen 2014

机译：用于大型生物和医学数据集聚类分析的主题建模
7. Topic Modeling : Clustering of Deep Webpages [O] . Muhunthaadithya C, Rohit J.V, Sadhana Kesavan, 2015

机译：主题建模：深度网页的聚类

Topic Modeling : Clustering of Deep Webpages

摘要

著录项

相似文献

相关主题

期刊订阅