首页> 外文会议>European Conference on Principles and Practice of Knowledge Discovery in Databases >Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction
【24h】

Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction

机译:使用Web减少基于模式的信息提取中的数据稀疏性

获取原文

摘要

Textual patterns have been used effectively to extract information from large text collections. However they rely heavily on textual redundancy in the sense that facts have to be mentioned in a similar manner in order to be generalized to a textual pattern. Data sparseness thus becomes a problem when trying to extract information from hardly redundant sources like corporate intranets, encyclopedic works or scientific databases. We present results on applying a weakly supervised pattern induction algorithm to Wikipedia to extract instances of arbitrary relations. In particular, we apply different configurations of a basic algorithm for pattern induction on seven different datasets. We show that the lack of redundancy leads to the need of a large amount of training data but that integrating Web extraction into the process leads to a significant reduction of required training data while maintaining the accuracy of Wikipedia. In particular we show that, though the use of the Web can have similar effects as produced by increasing the number of seeds, it leads overall to better results. Our approach thus allows to combine advantages of two sources: The high reliability of a closed corpus and the high redundancy of the Web.
机译:文本模式已有效地用于从大型文本集合中提取信息。然而,他们在很大程度上严重依赖于文本冗余,因为必须以类似的方式提及事实,以便概括为文本模式。因此,数据稀疏性在尝试从公司内联网,百科全书或科学数据库等冗余源中提取信息时成为问题。我们提出了将弱监督模式诱导算法应用于维基百科以提取任意关系的实例。特别是,我们在七个不同的数据集上应用了不同的模式感应算法的不同配置。我们表明,缺少冗余引线的需要大量的训练数据,但该网站提取融入过程导致一个显著减少所需的训练数据,同时维持维基百科的准确性。特别是我们展示了,尽管使用Web的使用可以通过增加种子的数量产生类似的效果,但它总体上导致更好的结果。因此,我们的方法允许将两个来源的优点结合起来:封闭的语料库的高可靠性和网络的高冗余。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号