【24h】

Scalable text semantic clustering around topics

机译:围绕主题的可扩展文本语义聚类

获取原文
获取原文并翻译 | 示例
           

摘要

Detection of topics in Natural Language text collections is an important step towards flexible automated text handling, for tasks like text translation, summarization, etc. In the current dominant paradigm to topic modeling, topics are represented as probability distributions of terms. Although such models are theoretically sound, their high computational complexity makes them difficult to use in very large scale collections. In this work we propose an alternative topic modeling paradigm based on a simpler representation of topics as overlapping clusters of semantically similar documents, that is able to take advantage of highly-scalable clustering algorithms. Our Query-based Topic Modeling framework (QTM) is an information-theoretic method that assumes the existence of a "golden" set of queries that can capture most of the semantic information of the collection and produce models with maximum "semantic coherence". QTM was designed with scalability in mind and was executed in parallel using a Map-Reduce implementation; further, we show complexity measures that support our scalability claims. Our experiments show that the QTM can produce models of comparable or even superior quality than those produced by state of the art probabilistic methods.
机译:检测自然语言文本集合中的主题是迈向灵活的自动文本处理的重要一步,对于文本转换,摘要等所在的任务,在当前的主导范例到主题建模中,主题表示为术语的概率分布。虽然这种模型是理论上的声音,但它们的高计算复杂性使得它们难以在非常大的比例集中使用。在这项工作中,我们提出了一种替代主题建模范式,基于更简单的主题表示作为与语义上类似文档的重叠群体,能够利用高度可扩展的聚类算法。我们基于查询的主题建模框架(QTM)是一种信息 - 理论方法,假设存在一个“金色”的查询集,可以捕获集合的大多数语义信息,并产生最大“语义相干性”的模型。 QTM设计有可扩展性,并使用地图减少实施并行执行;此外,我们展示了支持我们可扩展性索赔的复杂性措施。我们的实验表明,QTM可以产生比通过最先进的概率方法产生的可比甚至优越的型号。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号