Streaming and Sketch Algorithms for Large Data NLP.

机译：大数据NLP的流和草图算法。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

The availability of large and rich quantities of text data is due to the emergence of the World Wide Web, social media, and mobile devices. Such vast data sets have led to leaps in the performance of many statistically-based problems. Given a large magnitude of text data available, it is computationally prohibitive to train many complex Natural Language Processing (NLP) models on large data. This motivates the hypothesis that simple models trained on big data can outperform more complex models with small data. My dissertation provides a solution to effectively and efficiently exploit large data on many NLP applications. Datasets are growing at an exponential rate, much faster than increase in memory. To provide a memory-efficient solution for handling large datasets, this dissertation show limitations of existing streaming and sketch algorithms when applied to canonical NLP problems and proposes several new variants to overcome those shortcomings. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, much smaller than the full size of the input. These algorithms can easily be implemented in a distributed setting and provide a solution that is both memory- and time-efficient. However, the memory and time savings come at the expense of approximate solutions. In this dissertation, I demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data and outperform exact solutions on smaller data. I focus on many NLP problems that boil down to tracking many statistics, like storing approximate counts, computing approximate association scores like pointwise mutual information (PMI), finding frequent items (like n-grams), building streaming language models, and measuring distributional similarity. First, I introduce the concept of approximate streaming large-scale language models in NLP. Second, I present a novel variant of the Count-Min sketch that maintains approximate counts of all items. Third, I conduct a systematic study and compare many sketch algorithms that approximate count of items with focus on large-scale NLP tasks. Last, I develop fast large-scale approximate graph (FLAG), a system that quickly constructs a large-scale approximate nearest-neighbor graph from a large corpus.

机译：大量文本数据的可用性是由于万维网，社交媒体和移动设备的出现。如此庞大的数据集导致许多基于统计的问题在性能上的飞跃。给定大量可用的文本数据，在大数据上训练许多复杂的自然语言处理（NLP）模型在计算上是禁止的。这激发了这样的假设，即在大数据上训练的简单模型可以胜过小数据的更复杂模型。本文为有效地利用许多NLP应用程序中的大数据提供了一种解决方案。数据集以指数级的速度增长，比内存增长快得多。为了提供一种内存有效的解决方案来处理大型数据集，本文显示了在应用于规范化NLP问题时现有流和草图算法的局限性，并提出了几种新的变体来克服这些缺点。流和草图算法一次性处理大型数据集，并以紧凑的摘要表示大型数据集，该摘要要比输入的完整大小小得多。这些算法可以在分布式环境中轻松实现，并提供既节省内存又节省时间的解决方案。但是，节省内存和时间会以近似解决方案为代价。在本文中，我证明了在大数据上获得的近似解与在大数据上的精确解具有可比性，而在小数据上的精确解却优于。我专注于许多NLP问题，这些问题归结为跟踪许多统计信息，例如存储近似计数，计算近似关联分数（例如逐点互信息（PMI），查找频繁项（例如n-gram），构建流语言模型以及测量分布相似度）。首先，我介绍NLP中的近似流式大规模语言模型的概念。其次，我介绍了一种Count-Min草图的新颖变体，该变体保留所有项目的近似计数。第三，我进行了系统的研究，并比较了许多草图算法，这些算法近似地估计了项目数，重点是大型NLP任务。最后，我开发了快速大规模近似图（FLAG），该系统可以从大型语料库快速构建大规模近似最近邻图。

著录项

作者
Goyal, Amit.;
展开▼
作者单位

University of Maryland, College Park.;

展开▼
授予单位 University of Maryland, College Park.;
学科 Computer science.
学位 Ph.D.
年度 2013
页码 99 p.
总页数 99
原文格式 PDF
正文语种 eng
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. FID-sketch: an accurate sketch to store frequencies in data streams [J] . Yang Tong, Zhang Haowei, Wang Hao, World Wide Web . 2019,第6期

机译：FID草图：将频率存储在数据流中的准确草图
2. Application of e-testers algorithms under sketch and streaming calculation model in robot navigation [J] . CARLOS RODRIGUEZ LUCATERO WSEAS Transactions on Computers . 2009,第7a9期

机译：草图和流计算模型下的电子测试算法在机器人导航中的应用
3. Sketching algorithms for genomic data analysis and querying in a secure enclave [J] . Kockan Can, Zhu Kaiyuan, Dokmai Natnatee, Nature methods . 2020,第3期

机译：基因组数据分析的素描算法和安全的飞地查询
4. A Sketch-Based Naive Bayes Algorithms for Evolving Data Streams [C] . Maroua Bahri, Silviu Maniu, Albert Bifet IEEE International Conference on Big Data . 2018

机译：用于数据流发展的基于草图的朴素贝叶斯算法
5. Space-efficient data sketching algorithms for network applications. [D] . Hua, Nan. 2012

机译：适用于网络应用程序的节省空间的数据草绘算法。
6. When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data [O] . Will P. M. Rowe 2019

机译：当堤坝断裂时：实用的草绘算法指南用于处理大量基因组数据
7. Sketching and Streaming Algorithms for Processing Massive Data [O] . Jelani Nelson 2013

机译：用于处理海量数据的草绘和流式算法
8. Distributed Computing for Signal Processing: Modeling of Asynchronous Parallel Computation. Appendix D. Analysis of MIMD (Multiple Instruction Streams, Multiple Data Streams) Algorithms: Features, Measurements, and Results [R] . Smith, K. D. 1984

机译：信号处理的分布式计算：异步并行计算的建模。附录D. mImD（多指令流，多数据流）算法的分析：特征，测量和结果

Streaming and Sketch Algorithms for Large Data NLP.

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅