【24h】

Concept vector extraction from Wikipedia category network

机译:从Wikipedia类别网络中提取概念向量

获取原文

摘要

The availability of machine readable taxonomy has been demonstrated by various applications such as document classification and information retrieval. One of the main topics of automated taxonomy extraction research is Web mining based statistical NLP and a significant number of researches have been conducted. However, existing works on automatic dictionary building have accuracy problems due to the technical limitation of statistical NLP (Natural Language Processing) and noise data on the WWW. To solve these problems, in this work, we focus on mining Wikipedia, a large scale Web encyclopedia. Wikipedia has high-quality and huge-scale articles and a category system because many users in the world have edited and refined these articles and category system daily. Using Wikipedia, the decrease of accuracy deriving from NLP can be avoided. However, affiliation relations cannot be extracted by simply descending the category system automatically since the category system in Wikipedia is not in a treestructure but a network structure. We propose concept vectorization methods which are applicable to the category network structured in Wikipedia.
机译:机器可读分类法的可用性已由各种应用程序证明,例如文档分类和信息检索。自动分类学提取研究的主要主题之一是基于Web挖掘的统计NLP,并且已经进行了大量研究。但是,由于统计NLP(自然语言处理)和WWW上的噪声数据的技术局限性,现有的自动词典构建工作存在准确性问题。为了解决这些问题,在这项工作中,我们专注于挖掘Wikipedia,这是一个大规模的Web百科全书。 Wikipedia拥有高质量的大规模文章和分类系统,因为世界上许多用户每天都在编辑和完善这些文章和分类系统。使用维基百科,可以避免源自NLP的准确性下降。但是,由于维基百科中的类别系统不是树结构而是网络结构,因此无法通过简单地自动降低类别系统来提取关联关系。我们提出了概念向量化方法,该方法适用于Wikipedia中构造的类别网络。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号