首页> 外文会议>ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008 >Information Extraction from Wikipedia: Moving Down the Long Tail
【24h】

Information Extraction from Wikipedia: Moving Down the Long Tail

机译:维基百科中的信息提取:长尾巴

获取原文

摘要

Not only is Wikipedia a comprehensive source of quality information, it has several kinds of internal structure (e.g., relational summaries known as infoboxes), which enable self-supervised information extraction. While previous efforts at extraction from Wikipedia achieve high precision and recall on well-populated classes of articles, they fail in a larger number of cases, largely because incomplete articles and infrequent use of infoboxes lead to insufficient training data. This paper presents three novel techniques for increasing recall from Wikipedia's long tail of sparse classes: (1) shrinkage over an automatically-learned subsumption taxonomy, (2) a retraining technique for improving the training data, and (3) supplementing results by extracting from the broader Web. Our experiments compare design variations and show that, used in concert, these techniques increase recall by a factor of 1.76 to 8.71 while maintaining or increasing precision.
机译:Wikipedia不仅是质量信息的综合来源,而且它具有多种内部结构(例如,称为信息框的关系摘要),可以进行自我监督的信息提取。尽管以前从Wikipedia提取数据的努力达到了很高的精度,并且可以回想起填充良好的文章类,但它们在很多情况下都失败了,这在很大程度上是因为文章不完整和信息框的不经常使用导致训练数据不足。本文介绍了三种新颖的技术,可提高Wikipedia稀疏类的长尾记忆:(1)缩小自动学习的分类法的范围;(2)一种用于改进训练数据的再训练技术;以及(3)通过从中提取信息来补充结果更广泛的网络。我们的实验比较了设计变体,结果表明,这些技术共同使用时,可以在保持或提高精度的同时,将召回率提高1.76至8.71倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号