首页> 外文会议>ACM conference on information and knowledge management >Language Pyramid and Multi-Scale Text Analysis
【24h】

Language Pyramid and Multi-Scale Text Analysis

机译:语言金字塔和多尺度文本分析

获取原文

摘要

The classical Bag-of- Word (BOW) model represents a document as a histogram of word occurrence, losing the spatial information that is invaluable for many text analysis tasks. In this paper, we present the Language Pyramid (LaP) model, which casts a document as a probabilistic distribution over the joint semantic-spatial space and motivates a multi-scale 2D local smoothing framework for nonpara-metric text coding. LaP efficiently encodes both semantic and spatial contents of a document into a pyramid of matrices that are smoothed both semantically and spatially at a sequence of resolutions, providing a convenient multi-scale imagic view for natural language understanding. The LaP representation can be used in text analysis in a variety of ways, among which we investigate two instantiations in the current paper: (1) multi-scale text kernels for document categorization, and (2) multi-scale language models for ad hoc text retrieval. Experimental results illustrate that: for classification, LaP outperforms BOW by (up to) 4% on moderate-length texts (RCV1 text benchmark) and 15% on short texts (Yahoo! queries); and for retrieval, LaP gains 12% MAP improvement over uni-gram language models on the OHSUMED data set.
机译:典型的单词(弓)模型代表文档作为单词发生的直方图,丢失了对于许多文本分析任务非常有价值的空间信息。在本文中,我们介绍了语言金字塔(LAP)模型,该模型将文档作为联合语义空间空间上的概率分布施放,并激励用于非Para-urg文本编码的多尺度2D局部平滑框架。 LAP有效地将文档的语义和空间内容物编码成矩阵的金字塔,该金字塔在语义上和空间上以一系列分辨率进行了平滑,为自然语言理解提供了方便的多尺度图像视图。 LAP表示可以以各种方式在文本分析中使用,其中我们在目前的论文中调查了两个实例化:(1)用于文档分类的多尺度文本内核,和(2)用于ad hoc的多尺度语言模型文本检索。实验结果表明:对于分类,LAP在中等长度文本(RCV1 Text基准)上(RCV1文本基准)和15%的短篇小说(雅虎查询);对于检索,LAP在OHSUMED数据集上通过UNI-Gram语言模型进行12%的地图改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号