首页> 外国专利> System and method for providing lossless compression of n-gram language models in a real-time decoder

System and method for providing lossless compression of n-gram language models in a real-time decoder

机译：用于在实时解码器中提供n元语法模型的无损压缩的系统和方法

页面导航

摘要
著录项
相似文献

摘要

System and methods for compressing (losslessly) n-gram language models for use in real-time decoding, whereby the size of the model is significantly reduced without increasing the decoding time of the recognizer. Lossless compression is achieved using various techniques. In one aspect, n-gram records of an N-gram language model are split into (i) a set of common history records that include subsets of n-tuple words having a common history and (ii) sets of hypothesis records that are associated with the common history records. The common history records are separated into a first group of common history records each having only one hypothesis record associated therewith and a second group of common history records each having more than one hypothesis record associated therewith. The first group of common history records are stored together with their corresponding hypothesis record in an index portion of a memory block comprising the N-gram language model and the second group of common history records are stored in the index together with addresses pointing to a memory location having the corresponding hypothesis records. Other compression techniques include, for instance, mapping word records of the hypothesis records into word numbers and storing a difference value between subsequent word numbers; segmenting the addresses and storing indexes to the addresses in each segment to multiples of the addresses; storing word records and probability records as fractions of bytes such that each pair of word- probability records occupies a multiple of bytes and storing flags indicating the length; and storing the probability records as indexes to sorted count values that are used to compute the probability on the run.

机译：用于压缩（无损）n-gram语言模型以用于实时解码的系统和方法，从而在不增加识别器解码时间的情况下，极大地减小了模型的大小。使用各种技术可以实现无损压缩。一方面，将N-gram语言模型的n-gram记录分为（i）一组公共历史记录，这些记录包括具有公共历史的n个元组单词的子集，以及（ii）与之相关的假设记录集与共同的历史记录。将公共历史记录分为第一组公共历史记录，每个第一组公共历史记录仅具有一个与其相关联的假设记录，以及第二组公共历史记录，每个第二组公共历史记录均具有一个以上的与其相关联的假设记录。第一组公共历史记录与它们相应的假设记录一起存储在包含N-gram语言模型的存储块的索引部分中，第二组公共历史记录与指向存储器的地址一起存储在索引中具有相应假设记录的位置。其他压缩技术包括，例如，将假设记录的单词记录映射到单词编号中，并存储后续单词编号之间的差值;分割地址并在每个段中将地址的索引存储为地址的倍数;将单词记录和概率记录存储为字节的一部分，以使每对单词概率记录占用多个字节，并存储指示长度的标志;并将概率记录存储为排序的计数值的索引，这些计数值用于计算运行中的概率。

著录项

公开/公告号US6092038A

专利类型
公开/公告日2000-07-18

原文格式PDF
申请/专利权人 INTERNATIONAL BUSINESS MACHINES CORPORATION;
展开▼

申请/专利号US19980019012
发明设计人 SRINIVASA PATIBANDLA RAO;DIMITRI KANEVSKY;
展开▼

申请日1998-02-05
分类号G06F17/27;G06F17/28;
国家 US
入库时间 2022-08-22 01:36:39

相似文献

专利
外文文献
中文文献