首页> 外文会议>Data Engineering, ICDE, 2009 IEEE 25th International Conference on >Join Optimization of Information Extraction Output: Quality Matters!
【24h】

Join Optimization of Information Extraction Output: Quality Matters!

机译:加入信息提取输出的优化:质量问题!

获取原文

摘要

Information extraction (IE) systems are trained to extract specific relations from text databases. Real-world applications often require that the output of multiple IE systems be joined to produce the data of interest. To optimize the execution of a join of multiple extracted relations, it is not sufficient to consider only execution time. In fact, the quality of the join output is of critical importance: unlike in the relational world, different join execution plans can produce join results of widely different quality whenever IE systems are involved. In this paper, we develop a principled approach to understand, estimate, and incorporate output quality into the join optimization process over extracted relations. We argue that the output quality is affected by (a) the configuration of the IE systems used to process documents, (b) the document retrieval strategies used to retrieve documents, and (c) the actual join algorithm used. Our analysis considers several alternatives for these factors, and predicts the output quality---and, of course, the execution time---of the alternate execution plans. We establish the accuracy of our analytical models, as well as study the effectiveness of a quality-aware join optimizer, with a large-scale experimental evaluation over real-world text collections and state-of-the-art IE systems.
机译:信息提取(IE)系统受过训练,可以从文本数据库中提取特定的关系。实际应用程序通常需要将多个IE系统的输出连接起来以生成感兴趣的数据。为了优化多个提取关系的联接的执行,仅考虑执行时间是不够的。实际上,联接输出的质量至关重要:与关系世界不同,只要涉及IE系统,不同的联接执行计划就可以产生质量差异很大的联接结果。在本文中,我们开发了一种有原则的方法来理解,估计并将输出质量纳入提取关系的联接优化过程中。我们认为输出质量受(a)用于处理文档的IE系统的配置,(b)用于检索文档的文档检索策略以及(c)使用的实际联接算法的影响。我们的分析考虑了这些因素的几种替代方法,并预测了替代执行计划的输出质量,当然还有执行时间。我们建立了分析模型的准确性,并研究了具有质量意识的联接优化程序的有效性,并对真实世界的文本集合和最新的IE系统进行了大规模的实验评估。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号