首页> 外文会议>International Conference on Inductive Logic Programming >Incorporating Linguistic Expertise Using ILP for Named Entity Recognition in Data Hungry Indian Languages
【24h】

Incorporating Linguistic Expertise Using ILP for Named Entity Recognition in Data Hungry Indian Languages

机译:在饥饿的印度语言中使用ILP结合使用ILP的语言专业知识

获取原文

摘要

Developing linguistically sound and data-compliant rules for named entity annotation is usually an intensive and time consuming process for any developer or linguist. In this work, we present the use of two Inductive Logic Programming (ILP) technique s to construct rules for extracting instances of various named entity classes thereby reducing the efforts of a linguist/developer. Using ILP for rule development not only reduces the amount of effort required but also provides an interactive framework wherein a linguist can incorporate his intuition about named entities such as in form of mode declarations for refinements (suitably exposed for ease of use by the linguist) and the background knowledge (in the form of linguistic resources). We have a small amount of tagged data - approximately 3884 sentences for Marathi and 22748 sentences in Hindi. The paucity of tagged data for Indian languages makes manual development of rules more challenging, However, the ability to fold in background knowledge and domain expertise in ILP techniques comes to our rescue and we have been able to develop rules that are mostly linguistically sound that yield results comparable to rules handcrafted by linguists. The ILP approach has two advantages over the approach of hand-crafting all rules: (i) the development time reduces by a factor of 240 when ILP is used instead of involving a linguist for the entire rule development and (ii) the ILP technique has the computational edge that it has a complete and consistent view of all significant patterns in the data at the level of abstraction specified through the mode declarations. The point (ii) enables the discovery of rules that could be missed by the linguist and also makes it possible to scale the rule development to a larger training dataset. The rules thus developed could be optionally edited by linguistic experts and consolidated either (a) through default ordering (as in TILDE[1]) or (b) with an ordering induced using [2] or (c) by using the rules as features in a statistical graphical model such a conditional random field (CRF) [3]. We report results using WARMR [4] and TILDE to learn rules for named entities of Indian languages namely Hindi and Marathi.
机译:开发名为实体注释的语言和数据兼容规则通常是任何开发人员或语言学家的密集和耗时的过程。在这项工作中,我们介绍了两个归纳逻辑编程(ILP)技术S构建用于提取各种命名实体类的实例的规则,从而减少了语言/开发人员的努力。使用ILP进行规则开发不仅可以减少所需的工作量,而且还提供了一个交互式框架,其中语言学家可以包含他的直觉,例如以更新的模式声明形式(适当地暴露于语言学家)和背景知识(以语言资源的形式)。我们有少量标记数据 - Marathi的大约3884个句子和印地语的22748句话。印度语言标记数据的缺乏使得手工制定规则更具挑战性,但是,在ILP技术中折叠背景知识和域专业知识的能力来到我们的救援,我们能够制定大多数语言的规则结果与语言学家手工制作的规则相当。 ILP方法具有两个优点,通过所有规则的手工制作方法:(i)当使用ILP而不是涉及整个规则开发的语言学家而不是涉及整个规则开发的语言学家和(ii)的开发时间,开发时间减少了240倍。计算边缘,它在通过模式声明指定的抽象级别的数据中具有完整且一致的视图。点(ii)可以发现语言学家可能错过的规则,也可以使规则开发扩展到更大的训练数据集。因此,可以通过语言专家选择所开发的规则,并通过默认排序(如图1])或(b)通过使用[2]或(c)作为特征来诱导的排序来整合(a)在统计图形模型中,这种条件随机场(CRF)[3]。我们通过WALLR [4]和TILDE向结果报告结果,以了解印度语言的命名实体的规则即印度和马拉地赛。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号