首页> 外文学位 >Improving NLP systems using unconventional, freely-available data.
【24h】

Improving NLP systems using unconventional, freely-available data.

机译:使用非常规的免费数据改进NLP系统。

获取原文
获取原文并翻译 | 示例

摘要

Sentence labeling is a type of pattern recognition task that involves the assignment of a categorical label to each member of a sentence of observed words. Standard supervised sentence-labeling systems often have poor generalization: it is difficult to estimate parameters for words which appear in the test set, but seldom (or never) appear in the training set, because they only use words as features in their prediction tasks. Representation learning is a promising technique for discovering features that allow a supervised classifier to generalize from a source domain dataset to arbitrary new domains. We demonstrate that features which are learned from distributional representations of unlabeled data can be used to improve performance on out-of-vocabulary words and help the model to generalize. We also argue that it is important for a representation learner to be able to incorporate expert knowledge during its search for helpful features.;We investigate techniques for building open-domain sentence labeling systems that approach the ideal of a system whose accuracy is high and consistent across domains. In particular, we investigate unsupervised techniques for language model representation learning that provide new features which are stable across domains, in that they are predictive in both the training and out-of-domain test data. In experiments, our best system with the proposed techniques reduce error by as much as 11.4% relative to the previous system using traditional representations on the Part-of-Speech tagging task. Moreover, we leverage the Posterior Regularization framework, and develop an architecture for incorporating biases from prior knowledge into representation learning. We investigate three types of biases: entropy bias, distance bias and predictive bias. Experiments on two domain adaptation tasks show that our biased learners identify significantly better sets of features than unbiased learners. This results in a relative reduction in error of more than 16% for both tasks with respect to existing state-of-the-art representation learning techniques.;We also extend the idea of using additional unlabeled data to improve the system's performance on a different NLP task, word alignment. Traditional word alignment only takes a sentence-level aligned parallel corpus as input and generates the word-level alignments. However, as the integration of different cultures, more and more people are competent in multiple languages, and they often use elements of multiple languages in conversations. Linguist Code Switching (LCS) is such a situation where two or more languages show up in the context of a single conversation. Traditional machine translation (MT) systems treat LCS data as noise, or just as regular sentences. However, if LCS data is processed intelligently, it can provide a useful signal for training word alignment and MT models. In this work, we first extract constraints from this code switching data and then incorporate them into a word alignment model training procedure. We also show that by using the code switching data, we can jointly train a word alignment model and a language model using co-training. Our techniques for incorporating LCS data improve by 2.64 in BLEU score over a baseline MT system trained using only standard sentence-aligned corpora.
机译:句子标记是一种模式识别任务,涉及将分类标签分配给观察单词句子的每个成员。标准的受监督的句子标注系统通常具有较差的概括性:很难估计出现在测试集中的单词的参数,但是很少(或永远不会)出现在训练集中,因为它们仅将单词用作其预测任务中的特征。表示学习是一种有前途的技术,可用于发现允许监督分类器将其从源域数据集推广到任意新域的功能。我们证明,从未标记数据的分布表示中获悉的功能可用于改善词汇外单词的性能并帮助模型推广。我们还认为,对于表征学习者来说,在搜索有用的功能时能够吸收专家知识非常重要。;我们研究了构建开放域句子标签系统的技术,这些系统接近精度高且一致的系统的理想跨域。尤其是,我们研究了语言模型表示学习的无监督技术,这些技术提供了跨域稳定的新功能,因为它们在训练和域外测试数据中都是可预测的。在实验中,相对于在词性标注任务中使用传统表示形式的以前的系统,我们采用建议技术的最佳系统可将错误减少多达11.4%。此外,我们利用后验正则化框架,并开发了一种将先验知识的偏见纳入表示学习的架构。我们研究了三种类型的偏差:熵偏差,距离偏差和预测偏差。对两个领域适应任务的实验表明,与没有偏见的学习者相比,有偏见的学习者识别出的特征集要好得多。与现有的最先进的表示学习技术相比,这两项任务的错误相对减少了16%以上;我们还扩展了使用其他未标记数据的想法,以提高不同系统的性能NLP任务,单词对齐。传统单词对齐仅将句子级对齐的并行语料库作为输入,并生成单词级对齐。但是,随着不同文化的融合,越来越多的人能够使用多种语言,并且他们在对话中经常使用多种语言的元素。语言学家代码转换(LCS)是在单个对话的上下文中出现两种或多种语言的情况。传统的机器翻译(MT)系统将LCS数据视为噪声,或视为常规句子。但是,如果对LCS数据进行智能处理,则可以为训练单词对齐和MT模型提供有用的信号。在这项工作中,我们首先从此代码转换数据中提取约束,然后将它们合并到单词对齐模型训练过程中。我们还表明,通过使用代码转换数据,我们可以使用共同训练来共同训练单词对齐模型和语言模型。与仅使用标准句子对齐语料库训练的基准MT系统相比,我们用于合并LCS数据的技术的BLEU得分提高了2.64。

著录项

  • 作者

    Huang, Fei.;

  • 作者单位

    Temple University.;

  • 授予单位 Temple University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 112 p.
  • 总页数 112
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号