Learning to Differentiate Between Main-articles and Sub-articles in Wikipedia

机译：学习区分维基百科中的主要文章和子文章

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Current Wikipedia editing approaches typically summarize a named entity by one main-article supplemented by multiple sub-articles describing various aspects and subtopics of the entity. Such separation of articles aims at improving the curation of content-rich Wikipedia entities. However, a wide range of Wikipedia-based technologies critically rely on the article-as-concept assumption, which requires a one-to-one mapping between entities (or concepts) and the articles that describe these entities. Thus, the current editing approaches sow confusion and ambiguity to knowledge representation, and cause problems to a wide-range of downstream technologies. In this paper, we present an approach that resolves these problems by differentiating the main-article from the sub-articles that are not at the core of entity representations. We propose a hybrid neural article model that learns on two facets of a Wikipedia article: (i) Two neural document encoders capture the latent semantic features from the article title and text contents. (ii) A set of explicit features measure and characterize the symbolic and structural aspects of each article. In this study, we use crowdsourcing to create a large annotated dataset for feature extraction, and for evaluating a variety of encoding techniques and learning structures. The optimized model so derived identifies main articles with near-perfect precision and recall, and outperforms various baselines on the contributed dataset.

机译：当前的维基百科编辑方法通常通过一个主体来概括一个命名的实体，并辅以描述该实体的各个方面和子主题的多个子文章。文章的这种分离旨在改善内容丰富的Wikipedia实体的管理。但是，许多基于Wikipedia的技术都严重依赖于按文章概念的假设，这需要在实体（或概念）与描述这些实体的文章之间进行一对一映射。因此，当前的编辑方法给知识表示造成了混乱和模棱两可，并给范围广泛的下游技术造成了问题。在本文中，我们提出了一种方法，通过将主体与不在实体表示核心的子项区分开来解决这些问题。我们提出了一个混合的神经文章模型，该模型在Wikipedia文章的两个方面进行学习：（i）两个神经文档编码器从文章标题和文本内容中捕获潜在的语义特征。（ii）一组明确的特征可以衡量和表征每篇文章的符号和结构方面。在这项研究中，我们使用众包创建了一个大型带注释的数据集，用于特征提取以及评估各种编码技术和学习结构。这样得出的优化模型可以以近乎完美的精度和召回率识别主要文章，并且在贡献的数据集上优于各种基准。

著录项

来源
《IEEE International Conference on Big Data》|2019年|1442-1449|共8页
会议地点
作者
Muhao Chen; Changping Meng; Gang Huang; Carlo Zaniolo;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Encyclopedias; Electronic publishing; Internet; Semantics; Feature extraction; Logic gates;

机译：百科全书;电子出版;互联网;语义学;特征提取;逻辑门;

相似文献

外文文献
中文文献
专利

1. Mind the skills gap: the role of Internet know-how and gender in differentiated contributions to Wikipedia [J] . Eszter Hargittai, Aaron Shaw Information Communication & Society . 2015,第3a4期

机译：注意技能差距：互联网专业知识和性别在对Wikipedia的不同贡献中的作用
2. Usage of and learning from Wikipedia: a study of university students in Pakistan [J] . Wazzuha Amina, Nosheen Fatima Warraich, Amara Malik Library Review . 2021,第3期

机译：维基百科的使用和学习：巴基斯坦大学生研究
3. Automatically extracted parallel corpora enriched with highly useful metadata? A Wikipedia case study combining machine learning and social technology [J] . Aghaebrahimian Ahmad, Stauder Andy, Ustaszewski Michael Digital scholarship in the humanities . 2021,第1期

机译：自动提取富有非常有用的元数据的并行语料库？机器学习与社会技术的维基百科案例研究
4. Learning to Differentiate Between Main-articles and Sub-articles in Wikipedia [C] . Muhao Chen, Changping Meng, Gang Huang, IEEE International Conference on Big Data . 2019

机译：学习在维基百科的主要文章和子文章之间区分
5. Differentiated learning modules for corporate training based on learning styles [D] . Blackburn, Jeanine F. 2009

机译：基于学习风格的企业培训差异化学习模块
6. Ten simple rules for designing learning experiences that involve enhancing computational biology Wikipedia articles [O] . Alastair M. Kilpatrick, Audra Anjum, Lonnie Welch 2020

机译：设计学习体验的十条简单规则涉及增强计算生物学
7. The Wiki Learning Project: Wikipedia as an Open Learning Environment [O] . Ricaurte-Quijano Paola, Carli Álvarez Arianna 2016

机译：维基学习计划：作为开放学习环境的维基百科

Learning to Differentiate Between Main-articles and Sub-articles in Wikipedia

摘要

著录项

相似文献

相关主题

期刊订阅