首页> 外文OA文献 >D6.2 Integrated Final Version of the Components for Lexical Acquisition
【2h】

D6.2 Integrated Final Version of the Components for Lexical Acquisition

机译:D6.2用于词汇习得的组件的集成最终版本

代理获取
本网站仅为用户提供外文OA文献查询和代理获取服务,本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文,但由于OA文献来源多样且变更频繁,仍可能出现获取不到、文献不完整或与标题不符等情况,如果获取不到我们将提供退款服务。请知悉。

摘要

The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the languageu27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy.
机译:PANACEA项目已解决了最关键的瓶颈之一,该瓶颈威胁到支持欧洲多语种的技术的发展,并处理每年产生的大量多语种数据。任何自动语言处理的尝试,特别是机器翻译(MT)的尝试,都取决于特定语言资源的可用性。此类语言资源(LR)包含有关语言词典的信息,即该语言的单词及其使用特征。在自然语言处理(NLP)中,LR提供有关单词的句法和语义行为的信息-即它们的语法和含义-从而为MT等下游应用程序提供信息。迄今为止,已经手工生成了许多LR,需要语言专家的大量体力劳动。但是,如果手动进行,则不可能为MT开发人员所需的每对可能的欧洲语言,文本域和体裁提供LR。而且,由于自然语言的特性不断变化,特别是在新知识领域和新技术的出现刺激下,给定语言的LR永远不会被认为是完整的或最终的。 PANACEA通过建立一个LR工厂来解决这一挑战,该工厂逐步使MT系统所需的LR的获取,生产,更新和维护所涉及的阶段自动化。这样的工厂的存在将大大减少建造LR所需的成本,时间和人力。 WP6解决了LR工厂的词汇获取组件,即自动从文本中提取关键词汇信息的技术,以及将词汇信息以标准格式自动整理到LR中的技术。 WP6的目标是采用能够从语料数据中获取句法和语义信息,对其进行改进,将其改编并应用于多种语言并将其转变为能够支持大规模应用程序的强大而灵活的技术。改善词汇习得技术的可伸缩性和可移植性的一个重点是用更强大,更少“监督”的方法来扩展现有技术。在NLP中,监督量是指在将机器学习或其他技术应用于数据以编译词典之前必须应用于文本语料库的手动注释量。更多的手动注释意味着更准确的训练数据,从而也更准确的LR。但是,考虑到从成本和时间的角度来看,跨域手动注释多语言MT所需的大量数据是不切实际的,因此开发可以在较少监督的情况下向语料库学习的技术非常重要。受到较少监督的方法即使在数据稀缺的领域中也能够支持大规模采集和有效的领域适应。 PANACEA中词汇获取的另一个重点是LR用户需要调整LR的准确性级别。某些应用程序可能要求提高精度,而该应用程序要求对所使用的词汇信息具有高度的信心。在其他时候,可能需要更大的覆盖范围,同时会以一定程度的准确性为代价,提供有关更多单词的信息。 PANACEA中的词汇获取已研究了词汇获取的置信度阈值,以确保LR的最终用户可以以所需的准确性从PANACEA工厂生成词汇数据。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号