首页> 外文期刊>ACM Transactions on Information Systems >Handling Massive N-Gram Datasets Efficiently
【24h】

Handling Massive N-Gram Datasets Efficiently

机译:有效地处理海量N-Gram数据集

获取原文
获取原文并翻译 | 示例

摘要

Two fundamental problems concern the handling of large n-gram language models: Indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source.Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation.Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time.Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space.Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk.The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.
机译:两个基本问题与处理大型n-gram语言模型有关:建立索引,即在不损害其检索速度的情况下压缩n-gram和相关的卫星值,以及进行估算,即计算提取的n-gram的概率分布对于大量的信息检索,自然语言处理和机器学习领域的应用(例如搜索引擎中的自动完成和机器翻译)而言,高效地执行这两项任务至关重要。描述了压缩,精确和无损的数据结构,相对于最新的解决方案和相关软件包,这些数据结构可同时减少大量空间并且不会造成时间损失。特别是,我们提供了一种压缩的特里数据结构,其中,在固定长度k的上下文之后的n元语法的每个单词(即其前k个单词)被编码为整数,其值与所包含的单词数成正比遵循这样的背景。由于在自然语言中,遵循给定上下文的单词数量通常很少,因此我们将表示空间减小到前所未有的压缩级别,从而允许对数十亿个字符串进行索引。尽管节省了大量空间,但我们的技术在查询时引入了微不足道的损失,尤其是文献中空间效率最高的竞争者(既量化又有损)所占用的trie数据结构不少于5时间慢了。相反,我们的特里树与最快的竞争对手一样快,但在绝对空间上仍可保持高达65%的优势。关于估计问题,我们提出了一种新颖的算法,用于估计已改进的Kneser-Ney语言模型。 -学术界和行业中语言建模的首选,因为它们的困惑度相对较低。从大量的文本资源中估计这种模型提出了一个挑战,即设计一种算法来简化磁盘的使用。最新的算法在外部存储器中使用了三个排序步骤:我们展示了一种改进的构造,仅需一个排序步骤通过利用提取的n-gram字符串的属性。通过对数十亿个n-gram进行广泛的实验分析,我们显示出以前方法的总运行时间平均提高了4.5倍。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号