Handling Massive N-Gram Datasets Efficiently

Pibiri Giulio Ermanno; Venturini Rossano

首页> 外文期刊>ACM Transactions on Information Systems >Handling Massive N-Gram Datasets Efficiently

【24h】

Handling Massive N-Gram Datasets Efficiently

机译：有效地处理海量N-Gram数据集

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Two fundamental problems concern the handling of large n-gram language models: Indexing, that is, compressing the n-grams and associated satellite values without compromising their retrieval speed, and estimation, that is, computing the probability distribution of the n-grams extracted from a large textual source.Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation.Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n-gram following a context of fixed length k, that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time.Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space.Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk.The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n-gram strings. With an extensive experimental analysis performed on billions of n-grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.

机译：两个基本问题与处理大型n-gram语言模型有关：建立索引，即在不损害其检索速度的情况下压缩n-gram和相关的卫星值，以及进行估算，即计算提取的n-gram的概率分布对于大量的信息检索，自然语言处理和机器学习领域的应用（例如搜索引擎中的自动完成和机器翻译）而言，高效地执行这两项任务至关重要。描述了压缩，精确和无损的数据结构，相对于最新的解决方案和相关软件包，这些数据结构可同时减少大量空间并且不会造成时间损失。特别是，我们提供了一种压缩的特里数据结构，其中，在固定长度k的上下文之后的n元语法的每个单词（即其前k个单词）被编码为整数，其值与所包含的单词数成正比遵循这样的背景。由于在自然语言中，遵循给定上下文的单词数量通常很少，因此我们将表示空间减小到前所未有的压缩级别，从而允许对数十亿个字符串进行索引。尽管节省了大量空间，但我们的技术在查询时引入了微不足道的损失，尤其是文献中空间效率最高的竞争者（既量化又有损）所占用的trie数据结构不少于5时间慢了。相反，我们的特里树与最快的竞争对手一样快，但在绝对空间上仍可保持高达65％的优势。关于估计问题，我们提出了一种新颖的算法，用于估计已改进的Kneser-Ney语言模型。 -学术界和行业中语言建模的首选，因为它们的困惑度相对较低。从大量的文本资源中估计这种模型提出了一个挑战，即设计一种算法来简化磁盘的使用。最新的算法在外部存储器中使用了三个排序步骤：我们展示了一种改进的构造，仅需一个排序步骤通过利用提取的n-gram字符串的属性。通过对数十亿个n-gram进行广泛的实验分析，我们显示出以前方法的总运行时间平均提高了4.5倍。

著录项

来源
《ACM Transactions on Information Systems》 |2019年第2期|25.1-25.41|共41页
作者
Pibiri Giulio Ermanno; Venturini Rossano;
展开▼
作者单位

Univ Pisa, Largo Bruno Pontecorvo 3, I-56127 Pisa, Italy|ISTI CNR, Via Giuseppe Moruzzi 1, I-56127 Pisa, Italy;

Univ Pisa, Largo Bruno Pontecorvo 3, I-56127 Pisa, Italy|ISTI CNR, Via Giuseppe Moruzzi 1, I-56127 Pisa, Italy;

展开▼
收录信息美国《科学引文索引》(SCI);美国《工程索引》(EI);
原文格式 PDF
正文语种 eng
中图分类
关键词
Efficiency; scalability; algorithm engineering;

机译：效率;可扩展性;算法工程;

相似文献

外文文献
中文文献
专利

1. Handling Massive N-Gram Datasets Efficiently [J] . Pibiri Giulio Ermanno, Venturini Rossano ACM Transactions on Information Systems . 2019,第2期

机译：有效处理大规模的n-gram数据集
2. Efficient Data Structures for Massive N-Gram Datasets [J] . Giulio Ermanno Pibiri, Rossano Venturini ACM SIGIR FORUM . 2017,第cd期

机译：大量N-Gram数据集的有效数据结构
3. Efficient, robust and effective rank aggregation for massive biological datasets [J] . Pierre Andrieu, Bryan Brancotte, Laurent Bulteau, Future generation computer systems . 2021,第Nova期

机译：大规模生物数据集的高效，稳健和有效等级聚集
4. A distributed algorithm for the efficient computation of the unified model of social influence on massive datasets [C] . Alex Popa, Marc Frincu, Charalampos Chelmis IEEE High Performance Extreme Computing Conference . 2017

机译：一种有效计算海量数据社会影响统一模型的分布式算法
5. Combinatorial Optimization on Massive Datasets: Streaming, Distributed, and Massively Parallel Computation [D] . Assadi, Sepehr. 2018

机译：大规模数据集的组合优化：流式，分布式和大规模并行计算
6. Uvf - Unified Volume Format: A General System for Efficient Handling of Large Volumetric Datasets [O] . Jens Krüger, Kristin Potter, Rob S. MacLeod, -1

机译：Uvf-统一卷格式：高效处理大体积数据集的通用系统
7. Efficient Handling of N-gram Language Models for Statistical Machine Translation [O] . Marcello Federico, Fondazione Bruno, Kessler Irst, 2009

机译：统计机器翻译的N-gram语言模型的有效处理
8. Faster Parallel Algorithm and Efficient Multithreaded Implementations for Evaluating Betweenness Centrality on Massive Datasets [R] . Madduri, K., Ediger, D., Jiang, K., 2008

机译：更快的并行算法和高效的多线程实现，用于评估海量数据集的中介中心性

Handling Massive N-Gram Datasets Efficiently

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅