首页> 外文期刊>ACM transactions on Asian language information processing >On the Construction of Web NER Model Training Tool based on Distant Supervision
【24h】

On the Construction of Web NER Model Training Tool based on Distant Supervision

机译:基于遥远监督的Web Ner模型培训工具建设

获取原文
获取原文并翻译 | 示例
       

摘要

Named entity recognition (NER) is an important task in natural language understanding, as it extracts the key entities (person, organization, location, date, number, etc.) and objects (product, song, movie, activity name, etc.) mentioned in texts. However, existing natural language processing (NLP) tools (such as Stanford NER) recognize only general named entities or require annotated training examples and feature engineering for supervised model construction. Since not all languages or entities have public NER support, constructing a tool for NER model training is essential for low-resource language or entity information extraction. In this article, we study the problem of developing a tool to prepare training corpus from the Web with known seed entities for custom NER model training via distant supervision. The major challenge of automatic labeling lies in the long labeling time due to large corpus and seed entities as well as the concern to avoid false positive and false negative examples due to short and long seeds. To solve this problem, we adopt locality-sensitive hashing (LSH) for various length of seed entities. We conduct experiments on five types of entity recognition tasks, including Chinese person names, food names, locations, points of interest (POIs), and activity names to demonstrate the improvements with the proposed Web NER model construction tool. Because the training corpus is obtained by automatic labeling of the seed entity-related sentences, one could use either the entire corpus or the positive only sentences for model training. Based on the experimental results, we found the decision should depend on whether traditional linear chained conditional random fields (CRF) or deep neural network-based CRF is used for model training as well as the completeness of the provided seed list.
机译:命名实体识别(ner)是自然语言理解的重要任务,因为它提取关键实体(人,组织,位置,日期,数字等)和对象(产品,歌曲,电影,活动名称等)在文本中提到。然而,现有的自然语言处理(NLP)工具(如斯坦福网)仅识别一般的命名实体,或者需要注释的训练示例和用于监督模型建设的特征工程。由于并非所有语言或实体都有公共网页支持,因此构建用于NER模型培训的工具对于低资源语言或实体信息提取至关重要。在本文中,我们研究了开发工具的问题,以通过遥控监督,通过已知的种子实体从网络中培训培训语料库。自动标签的主要挑战在于由于大型语料库和种子实体,避免由于短而长的种子而避免假阳性和假阴性示例的关注。为了解决这个问题,我们采用各种种子实体长度的地方敏感散列(LSH)。我们对五种类型的实体识别任务进行实验,包括中国人名,食品名称,地点,兴趣点(POI)和活动名称,以展示所提出的网页模型建设工具的改进。由于培训语料库是通过自动标记种子实体相关句子获得的,所以可以使用整个语料库或肯定的模型训练。基于实验结果,我们发现该决定应取决于传统的线性链式条件随机场(CRF)或基于深神经网络的CRF用于模型培训以及提供的种子列表的完整性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号