首页> 外文会议>Conference on empirical methods in natural language processing >Regular Expression Guided Entity Mention Mining from Noisy Web Data
【24h】

Regular Expression Guided Entity Mention Mining from Noisy Web Data

机译:从嘈杂的Web数据中进行正则表达式引导的实体提及挖掘

获取原文
获取外文期刊封面目录资料

摘要

Many important entity types in web documents, such as dates, times, email addresses. and course numbers, follow or closely resemble patterns that can be described by Regular Expressions (REs). Due to a vast diversity of web documents and ways in which they are being generated, even seemingly straightforward tasks such as identifying mentions of date in a document become very challenging. It is reasonable to claim that it is impossible to create a RE that is capable of identifying such entities from web documents with perfect precision and recall. Rather than abandoning REs as a go-to approach for entity detection, this paper explores ways to combine the expressive power of REs, ability of deep learning to learn from large data, and human-in-the loop approach into a new integrated framework for entity identification from web data. The framework starts by creating or collecting the existing REs for a particular type of an entity. Those REs are then used over a large document corpus to collect weak labels for the entity mentions and a neural network is trained to predict those RE-generated weak labels. Finally, a human expert is asked to label a small set of documents and the neural network is fine tuned on those documents. The experimental evaluation on several entity identification problems shows that the proposed framework achieves impressive accuracy, while requiring very modest human effort.
机译:Web文档中许多重要的实体类型,例如日期,时间,电子邮件地址。和课程编号,遵循或非常类似于正则表达式(RE)可以描述的模式。由于Web文档及其生成方式的多样性,即使是看似简单的任务,例如标识文档中日期的提及也变得非常具有挑战性。可以合理地说,不可能创建一个能够以完美的精度和召回率从Web文档中识别此类实体的RE。本文不是将RE放弃为实体检测的首选方法,而是探索了将RE的表达能力,深度学习从大数据中学习的能力以及人在环方法结合到一个新的集成框架中的方法网络数据中的实体标识。该框架首先为特定类型的实体创建或收集现有RE。然后,将这些RE用于大型文档语料库,以收集有关实体提及的薄弱标签,并训练神经网络来预测那些RE生成的薄弱标签。最后,要求人类专家标记少量文档,然后对这些文档进行神经网络微调。对几个实体识别问题的实验评估表明,所提出的框架实现了令人印象深刻的准确性,同时需要非常适度的人力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号