首页> 外文会议>International conference on very large data bases;VLDB 2010 >Interesting-Phrase Mining for Ad-Hoc Text Analytics
【24h】

Interesting-Phrase Mining for Ad-Hoc Text Analytics

机译:临时文本分析的有趣短语挖掘

获取原文

摘要

Large text corpora with news, customer mail and reports, or Web 2.0 contributions offer a great potential for enhancing business-intelligence applications. We propose a framework for performing text analytics on such data in a versatile, efficient, and scalable manner. While much of the prior literature has emphasized mining keywords or tags in blogs or social-tagging communities, we emphasize the analysis of interesting phrases. These include named entities, important quotations, market slogans, and other multi-word phrases that are prominent in a dynamically derived ad-hoc subset of the corpus, e.g., being frequent in the subset but relatively infrequent in the overall corpus. We develop preprocessing and indexing methods for phrases, paired with new search techniques for the top-k most interesting phrases in ad-hoc subsets of the corpus. Our framework is evaluated using a large-scale real-world corpus of New York Times news articles.
机译:具有新闻,客户邮件和报告或Web 2.0贡献的大型文本语料库具有增强业务智能应用程序的巨大潜力。我们提出了一种框架,用于以通用,高效和可扩展的方式对此类数据执行文本分析。尽管许多现有文献都强调在博客或社会标签社区中挖掘关键字或标签,但我们强调对有趣短语的分析。这些包括命名实体,重要报价单,市场口号和其他多词短语,它们在语料库的动态派生即席子集中比较突出,例如,在子集中很常见,而在整个语料库中相对很少见。我们开发了短语的预处理和索引方法,并结合了新的搜索技术,用于对语料库的即席子集中的前k个最有趣的短语进行搜索。我们的框架是使用《纽约时报》新闻文章的大型真实语料库进行评估的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号