首页> 外文会议>Annual meeting of the Association for Computational Linguistics;ACL 2012 >Big Data versus the Crowd: Looking for Relationships in All the Right Places
【24h】

Big Data versus the Crowd: Looking for Relationships in All the Right Places

机译:大数据与人群:在正确的地方寻找关系

获取原文

摘要

Classically, training relation extractors relies on high-quality, manually annotated training data, which can be expensive to obtain. To mitigate this cost, NLU researchers have considered two newly available sources of less expensive (but potentially lower quality) labeled data from distant supervision and crowd sourcing. There is, however, no study comparing the relative impact of these two sources on the precision and recall of post-learning answers. To fill this gap, we empirically study how state-of-the-art techniques are affected by scaling these two sources. We use corpus sizes of up to 100 million documents and tens of thousands of crowd-source labeled examples. Our experiments show that increasing the corpus size for distant supervision has a statistically significant, positive impact on quality (F1 score). In contrast, human feedback has a positive and statistically significant, but lower, impact on precision and recall.
机译:传统上,训练关系提取器依赖于高质量,手动注释的训练数据,而获取这些数据可能会很昂贵。为了减轻此成本,NLU研究人员考虑了来自远程监管和众包的两个新的价格较低(但质量可能较低)标签数据的可用来源。但是,尚无研究比较这两种来源对学习后答案的准确性和记忆力的相对影响。为了填补这一空白,我们通过经验研究缩放这两个来源如何影响最新技术。我们使用多达1亿个文档的语料库大小和成千上万个带有众包标签的示例。我们的实验表明,增加远程监控的语料库大小对质量(F1评分)具有统计学上的显着积极影响。相比之下,人工反馈对准确性和召回率的影响为积极且具有统计学意义,但影响较小。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号