首页> 外文期刊>Audio, Speech, and Language Processing, IEEE Transactions on >Cross-Modality Semantic Integration With Hypothesis Rescoring for Robust Interpretation of Multimodal User Interactions
【24h】

Cross-Modality Semantic Integration With Hypothesis Rescoring for Robust Interpretation of Multimodal User Interactions

机译:带有假设记录的跨模态语义集成,用于多模态用户交互的鲁棒性解释

获取原文
获取原文并翻译 | 示例

摘要

We develop a framework pertaining to automatic semantic interpretation of multimodal user interactions using speech and pen gestures. The two input modalities abstract the user's intended message differently into input events, e.g., key terms/phrases in speech or different types of gestures in the pen modality. The proposed framework begins by generating partial interpretations for each input event as a ranked list of hypothesized semantics. We devise a cross-modality semantic integration procedure to align the pair of hypothesis lists between every speech input event and every pen input event in a multimodal expression. This is achieved by the Viterbi alignment algorithm that enforces the temporal ordering of the input events as well as the semantic compatibility of aligned events. The alignment enables generation of a unimodal, verbalized paraphrase that is semantically equivalent to the original multimodal expression. Our experiments are based on a multimodal corpus in the domain of city navigation. Application of the cross-modality integration procedure to near-perfect (manual) transcripts of the speech and pen modalities show that correct unimodal paraphrases are generated for over 97% of the training and test sets. However, if we replace with automatic speech and pen recognition transcripts, the performance drops to 53.7% and 54.8% for the training and test sets, respectively. In order to address this issue, we devised the hypothesis rescoring procedure that evaluates all candidates of cross-modality integration derived from multiple recognition hypotheses from each modality. The rescoring function incorporates the integration score, $N$-best purity of recognized spoken locative expressions, as well as distanc-n-nes between coordinates of recognized pen gestures and their interpreted icons on the map. Application of cross-modality hypothesis rescoring improved the performance to 67.5% and 69.9% for the training and test sets, respectively.
机译:我们开发了一个框架,该框架与使用语音和笔手势的多模式用户交互的自动语义解释有关。两种输入方式将用户的预期消息不同地抽象为输入事件,例如语音中的关键术语/短语或笔式方式中的不同类型的手势。所提出的框架通过为每个输入事件生成部分解释作为假设语义的排名列表而开始。我们设计了一种跨模式语义集成程序,以在多模式表达式中的每个语音输入事件和每个笔输入事件之间对齐一对假设列表。这可以通过维特比对齐算法来实现,该算法可强制输入事件的时间顺序以及对齐事件的语义兼容性。该比对使得能够生成语义上等同于原始多峰表达的单峰,言语复述。我们的实验基于城市导航领域的多模式语料库。跨模态整合程序对语音和笔模态的近乎完美(手动)笔录的应用表明,正确的单峰复述适用于超过97%的训练和测试集。但是,如果我们替换为自动语音和笔识别成绩单,则训练集和测试集的性能分别下降到53.7%和54.8%。为了解决这个问题,我们设计了一种假设记录程序,该过程评估了从每个模态的多个识别假设中得出的跨模态整合的所有候选者。计分功能整合了积分得分,$ N $-公认的口语定位表达的最佳纯度,以及识别出的笔形手势的坐标与其在地图上解释的图标之间的区别。交叉模态假设记录的应用将训练集和测试集的性能分别提高到67.5%和69.9%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号