首页> 外文会议>IEEE International Conference on Big Data >Learning to Differentiate Between Main-articles and Sub-articles in Wikipedia
【24h】

Learning to Differentiate Between Main-articles and Sub-articles in Wikipedia

机译:学习区分维基百科中的主要文章和子文章

获取原文

摘要

Current Wikipedia editing approaches typically summarize a named entity by one main-article supplemented by multiple sub-articles describing various aspects and subtopics of the entity. Such separation of articles aims at improving the curation of content-rich Wikipedia entities. However, a wide range of Wikipedia-based technologies critically rely on the article-as-concept assumption, which requires a one-to-one mapping between entities (or concepts) and the articles that describe these entities. Thus, the current editing approaches sow confusion and ambiguity to knowledge representation, and cause problems to a wide-range of downstream technologies. In this paper, we present an approach that resolves these problems by differentiating the main-article from the sub-articles that are not at the core of entity representations. We propose a hybrid neural article model that learns on two facets of a Wikipedia article: (i) Two neural document encoders capture the latent semantic features from the article title and text contents. (ii) A set of explicit features measure and characterize the symbolic and structural aspects of each article. In this study, we use crowdsourcing to create a large annotated dataset for feature extraction, and for evaluating a variety of encoding techniques and learning structures. The optimized model so derived identifies main articles with near-perfect precision and recall, and outperforms various baselines on the contributed dataset.
机译:当前的维基百科编辑方法通常通过一个主体来概括一个命名的实体,并辅以描述该实体的各个方面和子主题的多个子文章。文章的这种分离旨在改善内容丰富的Wikipedia实体的管理。但是,许多基于Wikipedia的技术都严重依赖于按文章概念的假设,这需要在实体(或概念)与描述这些实体的文章之间进行一对一映射。因此,当前的编辑方法给知识表示造成了混乱和模棱两可,并给范围广泛的下游技术造成了问题。在本文中,我们提出了一种方法,通过将主体与不在实体表示核心的子项区分开来解决这些问题。我们提出了一个混合的神经文章模型,该模型在Wikipedia文章的两个方面进行学习:(i)两个神经文档编码器从文章标题和文本内容中捕获潜在的语义特征。 (ii)一组明确的特征可以衡量和表征每篇文章的符号和结构方面。在这项研究中,我们使用众包创建了一个大型带注释的数据集,用于特征提取以及评估各种编码技术和学习结构。这样得出的优化模型可以以近乎完美的精度和召回率识别主要文章,并且在贡献的数据集上优于各种基准。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号