首页> 外文期刊>Future generation computer systems >An optimized approach for massive web page classification using entity similarity based on semantic network
【24h】

An optimized approach for massive web page classification using entity similarity based on semantic network

机译:基于语义网络的实体相似度海量网页分类的优化方法

获取原文
获取原文并翻译 | 示例

摘要

With the development of mobile technology, the users browsing habits are gradually shifted from only information retrieval to active recommendation. The classification mapping algorithm between users interests and web contents has been become more and more difficult with the volume and variety of web pages. Some big news portal sites and social media companies hire more editors to label these new concepts and words, and use the computing servers with larger memory to deal with the massive document classification, based on traditional supervised or semi-supervised machine learning methods. This paper provides an optimized classification algorithm for massive web page classification using semantic networks, such as Wikipedia, WordNet. In this paper, we used Wikipedia data set and initialized a few category entity words as class words. A weight estimation algorithm based on the depth and breadth of Wikipedia network is used to calculate the class weight of all Wikipedia Entity Words. A kinship-relation association based on content similarity of entity was therefore suggested optimizing the unbalance problem when a category node inherited the probability from multiple fathers. The keywords in the web page are extracted from the title and the main text using N-gram with Wikipedia Entity Words, and Bayesian classifier is used to estimate the page class probability. Experimental results showed that the proposed method obtained good scalability, robustness and reliability for massive web pages.
机译:随着移动技术的发展,用户的浏览习惯逐渐从仅信息检索转变为主动推荐。随着网页的数量和种类的增加,用户兴趣和网页内容之间的分类映射算法变得越来越困难。一些大型新闻门户网站和社交媒体公司雇用了更多的编辑人员来标记这些新概念和单词,并使用具有更大内存的计算服务器来处理基于传统监督或半监督机器学习方法的大量文档分类。本文为使用语义网络(例如Wikipedia,WordNet)的大规模网页分类提供了一种优化的分类算法。在本文中,我们使用Wikipedia数据集并初始化了一些类别实体词作为类词。基于Wikipedia网络深度和广度的权重估计算法用于计算所有Wikipedia实体词的类权重。因此,建议在类别节点从多个父亲继承概率的情况下,基于实体内容相似性的亲属关系关联来优化不平衡问题。使用带有Wikipedia实体词的N-gram从标题和正文中提取网页中的关键字,并使用贝叶斯分类器来估计页面分类的可能性。实验结果表明,该方法在海量网页上具有良好的可扩展性,鲁棒性和可靠性。

著录项

  • 来源
    《Future generation computer systems》 |2017年第11期|510-518|共9页
  • 作者单位

    Key Lab of Big Data Security and Intelligent Processing Institute of Computer Technology, School of Computer Science & Technology, School of Software Nanjing University of Posts and Telecommunications, Nanjing 210023, China;

    The Third Research Institute of the Ministry of Public Security, Shanghai, 201204, China;

    Key Lab of Big Data Security and Intelligent Processing Institute of Computer Technology, School of Computer Science & Technology, School of Software Nanjing University of Posts and Telecommunications, Nanjing 210023, China;

    Key Lab of Big Data Security and Intelligent Processing Institute of Computer Technology, School of Computer Science & Technology, School of Software Nanjing University of Posts and Telecommunications, Nanjing 210023, China;

    Department of Information Systems and Cyber Security, The University of Texas at San Antonio, San Antonio, TX 78249-0631, USA;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Web page classification; Semantic network; Kinship-relation association; Entity class probability; Hereditary weight;

    机译:网页分类;语义网络;亲属关系协会;实体类别概率;遗传体重;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号