首页> 外文会议>European Conference on Principle and Practice of Knowledge Discovery in Databases; 20070917-21; Warsaw(PL) >Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction
【24h】

Using the Web to Reduce Data Sparseness in Pattern-Based Information Extraction

机译:使用网络减少基于模式的信息提取中的数据稀疏性

获取原文
获取原文并翻译 | 示例

摘要

Textual patterns have been used effectively to extract information from large text collections. However they rely heavily on textual redundancy in the sense that facts have to be mentioned in a similar manner in order to be generalized to a textual pattern. Data sparseness thus becomes a problem when trying to extract information from hardly redundant sources like corporate intranets, encyclopedic works or scientific databases. We present results on applying a weakly supervised pattern induction algorithm to Wikipedia to extract instances of arbitrary relations. In particular, we apply different configurations of a basic algorithm for pattern induction on seven different datasets. We show that the lack of redundancy leads to the need of a large amount of training data but that integrating Web extraction into the process leads to a significant reduction of required training data while maintaining the accuracy of Wikipedia. In particular we show that, though the use of the Web can have similar effects as produced by increasing the number of seeds, it leads overall to better results. Our approach thus allows to combine advantages of two sources: The high reliability of a closed corpus and the high redundancy of the Web.
机译:文本模式已被有效地用于从大型文本集中提取信息。但是,它们在很大程度上依赖于文本冗余,在这种意义上,必须以类似的方式提及事实,以便将其概括为文本模式。因此,当试图从几乎没有冗余的资源(例如公司内部网,百科全书或科学数据库)中提取信息时,数据稀疏成为一个问题。我们目前在将弱监督模式归纳算法应用于Wikipedia提取任意关系实例的结果。特别是,我们在7个不同的数据集上应用了基本算法的不同配置来进行模式归纳。我们表明,缺乏冗余会导致需要大量的培训数据,但是将Web提取集成到流程中会导致所需培训数据的显着减少,同时又保持了Wikipedia的准确性。特别是,我们表明,尽管使用Web可以产生与增加种子数量所产生的相似效果,但总体上可以带来更好的结果。因此,我们的方法允许结合两种来源的优点:封闭语料库的高可靠性和Web的高冗余性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号