首页> 外文学位 >Resource generation from structured documents for low-density languages.
【24h】

Resource generation from structured documents for low-density languages.

机译:从结构化文档中为低密度语言生成资源。

获取原文
获取原文并翻译 | 示例

摘要

The availability and use of electronic resources for both manual and automated language related processing has increased tremendously in recent years. Nevertheless, many resources still exist only in printed form, restricting their availability and use. This especially holds true in low density languages or languages with limited electronic resources. For these documents, automated conversion into electronic resources is highly desirable.; This thesis focuses on the semi-automated conversion of printed structured documents (dictionaries in particular) to usable electronic representations. In the first part we present an entry tagging system that recognizes, parses, and tags the entries of a printed dictionary to reproduce the representation. The system uses the consistent layout and structure of the dictionaries, and the features that impose this structure, to capture and recover lexicographic information. We accomplish this by adapting two methods: rule-based and HMM-based. The system is designed to produce results quickly with minimal human assistance and reasonable accuracy. The use of an adaptive transformation-based learning as a post-processor at two points in the system yields significant improvements, even with an extremely small amount of user provided training data.; The second part of this thesis presents Morphology Induction from Noisy Data (MIND), a natural language morphology discovery framework that operates on information from limited, noisy data obtained from the conversion process. To use the resulting resources effectively, however, users must be able to search for them using the root form of morphologically deformed variant found in the text. Stemming and data driven methods are not suitable when data are sparse. The approach is based on the novel application of string searching algorithms. The evaluations show that MIND can segment words into roots and affixes from the noisy, limited data contained in a dictionary, and it can extract prefixes, suffixes, circumfixes, and infixes. MIND can also identify morphophonemic changes, i.e., phonemic variations between allomorphs of a morpheme, specifically point-of-affixation stem changes. This, in turn, allows non-native speakers to perform multilingual tasks for applications where response must be rapid, and they have limited knowledge. In addition, this analysis can feed other natural language processing tools requiring lexicons.
机译:近年来,用于手动和自动语言相关处理的电子资源的可用性和使用已大大增加。但是,许多资源仍然仅以印刷形式存在,从而限制了它们的可用性和使用。在低密度语言或电子资源有限的语言中尤其如此。对于这些文件,非常需要自动转换为电子资源。本文着重于将印刷的结构化文档(尤其是词典)半自动转换为可用的电子表示形式。在第一部分中,我们介绍一个条目标记系统,该系统识别,解析和标记已打印字典的条目以再现表示形式。系统使用字典的一致布局和结构以及强加此结构的功能来捕获和恢复词典信息。我们通过采用两种方法来实现此目的:基于规则和基于HMM。该系统旨在以最少的人工协助和合理的准确性快速产生结果。即使在用户提供的训练数据量非常少的情况下,在系统的两个点上使用基于自适应变换的学习作为后处理器也会产生重大改进。本文的第二部分介绍了“噪声数据的形态学归纳”(MIND),这是一种自然语言形态学发现框架,可对来自转换过程的有限噪声数据中的信息进行处理。但是,要有效地使用生成的资源,用户必须能够使用文本中发现的形态变形的变体的根形式来搜索它们。当数据稀疏时,阻止和数据驱动方法不适用。该方法基于字符串搜索算法的新颖应用。评估表明,MIND可以从字典中包含的嘈杂的有限数据中将单词分割成词根和词缀,并且可以提取前缀,后缀,后缀和后缀。 MIND还可以识别词素变化,即词素同素异形词之间的音素变化,特别是词缀词干变化。反过来,这允许非母语人士针对必须快速响应且知识有限的应用执行多种语言任务。此外,该分析还可以提供其他需要词典的自然语言处理工具。

著录项

  • 作者

    Karagol-Ayan, Burcu.;

  • 作者单位

    University of Maryland, College Park.$bComputer Science.;

  • 授予单位 University of Maryland, College Park.$bComputer Science.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2007
  • 页码 247 p.
  • 总页数 247
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类 自动化技术、计算机技术;
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号