首页> 外文会议>ACM conference on information and knowledge management;CIKM 09 >Helping Editors Choose Better Seed Sets for Entity Set Expansion
【24h】

Helping Editors Choose Better Seed Sets for Entity Set Expansion

机译:帮助编辑人员为实体集扩展选择更好的种子集

获取原文

摘要

Sets of named entities are used heavily at commercial search engines such as Google, Yahoo and Bing. Acquiring sets of entities typically consists of combining semi-supervised expansion algorithms with manual cleaning of the resulting expanded sets. In this paper, we study the effects of different seed sets in a state-of-the-art semi-supervised expansion system and show a tremendous variation in expansion performance depending on the choice of seeds. We further show that human editors, in general, provide very bad seed sets, which perform well-below the average random seed set. We identify three factors of seed set composition, namely proto-typicality, ambiguity and coverage, and we investigate their effects on expansion performance. Finally, we propose various automatic systems for improving editor-generated seed sets, which seek to remove ambiguous and other error-prone seed instances. An extensive experimental analysis shows that expansion quality, measured in R-precision, can be improved on average by a maximum of 46% by removing the right seeds from a seed set. Our automatic methods outperform the human editors seed sets and on average improve expansion performance by up to 34% over the original seed sets.
机译:命名实体集在诸如Google,Yahoo和Bing之类的商业搜索引擎中大量使用。获取实体集通常包括将半监督扩展算法与人工清理生成的扩展集相结合。在本文中,我们研究了最新的半监督扩展系统中不同种子集的影响,并显示了取决于种子选择的扩展性能的巨大变化。我们进一步表明,人类编辑通常会提供非常差的种子集,其表现要差于平均随机种子集。我们确定了种子集组成的三个因素,即原型性,歧义性和覆盖率,并研究了它们对扩展性能的影响。最后,我们提出了各种自动系统来改进编辑器生成的种子集,以寻求消除歧义和其他容易出错的种子实例。广泛的实验分析表明,通过从种子集中移除合适的种子,以R精度衡量的扩展质量平均最多可以提高46%。我们的自动方法优于人工编辑的种子集,平均而言,其扩展性能比原始种子集提高了34%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号