首页> 外文学位 >Streaming and Sketch Algorithms for Large Data NLP.
【24h】

Streaming and Sketch Algorithms for Large Data NLP.

机译:大数据NLP的流和草图算法。

获取原文
获取原文并翻译 | 示例

摘要

The availability of large and rich quantities of text data is due to the emergence of the World Wide Web, social media, and mobile devices. Such vast data sets have led to leaps in the performance of many statistically-based problems. Given a large magnitude of text data available, it is computationally prohibitive to train many complex Natural Language Processing (NLP) models on large data. This motivates the hypothesis that simple models trained on big data can outperform more complex models with small data. My dissertation provides a solution to effectively and efficiently exploit large data on many NLP applications. Datasets are growing at an exponential rate, much faster than increase in memory. To provide a memory-efficient solution for handling large datasets, this dissertation show limitations of existing streaming and sketch algorithms when applied to canonical NLP problems and proposes several new variants to overcome those shortcomings. Streaming and sketch algorithms process the large data sets in one pass and represent a large data set with a compact summary, much smaller than the full size of the input. These algorithms can easily be implemented in a distributed setting and provide a solution that is both memory- and time-efficient. However, the memory and time savings come at the expense of approximate solutions. In this dissertation, I demonstrate that approximate solutions achieved on large data are comparable to exact solutions on large data and outperform exact solutions on smaller data. I focus on many NLP problems that boil down to tracking many statistics, like storing approximate counts, computing approximate association scores like pointwise mutual information (PMI), finding frequent items (like n-grams), building streaming language models, and measuring distributional similarity. First, I introduce the concept of approximate streaming large-scale language models in NLP. Second, I present a novel variant of the Count-Min sketch that maintains approximate counts of all items. Third, I conduct a systematic study and compare many sketch algorithms that approximate count of items with focus on large-scale NLP tasks. Last, I develop fast large-scale approximate graph (FLAG), a system that quickly constructs a large-scale approximate nearest-neighbor graph from a large corpus.
机译:大量文本数据的可用性是由于万维网,社交媒体和移动设备的出现。如此庞大的数据集导致许多基于统计的问题在性能上的飞跃。给定大量可用的文本数据,在大数据上训练许多复杂的自然语言处理(NLP)模型在计算上是禁止的。这激发了这样的假设,即在大数据上训练的简单模型可以胜过小数据的更复杂模型。本文为有效地利用许多NLP应用程序中的大数据提供了一种解决方案。数据集以指数级的速度增长,比内存增长快得多。为了提供一种内存有效的解决方案来处理大型数据集,本文显示了在应用于规范化NLP问题时现有流和草图算法的局限性,并提出了几种新的变体来克服这些缺点。流和草图算法一次性处理大型数据集,并以紧凑的摘要表示大型数据集,该摘要要比输入的完整大小小得多。这些算法可以在分布式环境中轻松实现,并提供既节省内存又节省时间的解决方案。但是,节省内存和时间会以近似解决方案为代价。在本文中,我证明了在大数据上获得的近似解与在大数据上的精确解具有可比性,而在小数据上的精确解却优于。我专注于许多NLP问题,这些问题归结为跟踪许多统计信息,例如存储近似计数,计算近似关联分数(例如逐点互信息(PMI),查找频繁项(例如n-gram),构建流语言模型以及测量分布相似度) 。首先,我介绍NLP中的近似流式大规模语言模型的概念。其次,我介绍了一种Count-Min草图的新颖变体,该变体保留所有项目的近似计数。第三,我进行了系统的研究,并比较了许多草图算法,这些算法近似地估计了项目数,重点是大型NLP任务。最后,我开发了快速大规模近似图(FLAG),该系统可以从大型语料库快速构建大规模近似最近邻图。

著录项

  • 作者

    Goyal, Amit.;

  • 作者单位

    University of Maryland, College Park.;

  • 授予单位 University of Maryland, College Park.;
  • 学科 Computer science.
  • 学位 Ph.D.
  • 年度 2013
  • 页码 99 p.
  • 总页数 99
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号