首页> 外国专利> Systems and methods for constructing an artificially diverse corpus of training data samples for training a contextually-biased model for a machine learning-based dialogue system

Systems and methods for constructing an artificially diverse corpus of training data samples for training a contextually-biased model for a machine learning-based dialogue system

机译:用于构建人工多样化的训练数据样本语料库的系统和方法,用于训练基于机器学习的对话系统的上下文有偏模型

摘要

Systems and methods for constructing an artificially diverse corpus of training data includes evaluating a corpus of utterance-based training data samples, identifying a slot replacement candidate; deriving distinct skeleton utterances that include the slot replacement candidate, wherein deriving the distinct skeleton utterances includes replacing slots of each of the plurality of distinct utterance training samples with one of a special token and proper slot classification labels; selecting a subset of the distinct skeleton utterances; converting each of the distinct skeleton utterances of the subset back to distinct utterance training samples while still maintaining the special token at a position of the slot replacement candidate; altering a percentage of the distinct utterance training samples with a distinct randomly-generated slot token value at the position of the slot replacement candidate; and constructing the artificially diverse corpus of training samples based on a collection of the percentage of the distinct utterance training samples.
机译:用于构建人为地变化的训练数据语料库的系统和方法包括:评估基于话语的训练数据样本的语料库,识别时隙替换候选者;以及得出包括所述时隙替换候选者的不同骨架话语,其中,得出所述不同骨架话语包括用特殊标记和适当的时隙分类标签之一替换所述多个不同话语训练样本中的每一个的时隙;选择不同的骨架话语的子集;将子集的每个不同骨架发声转换回不同的发声训练样本,同时仍将特殊标记保持在时隙替换候选的位置;在时隙替换候选者的位置处用不同的随机生成的时隙令牌值来改变不同的话语训练样本的百分比;并基于不同话语训练样本的百分比集合,构建人工多样化的训练样本语料库。

著录项

相似文献

  • 专利
  • 外文文献
  • 中文文献
获取专利

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号