首页> 外文会议>Proceedings of the 45th Annual Hawaii International Conference on System Sciences >Mixed Graph of Terms: Beyond the Bags of Words Representation of a Text
【24h】

Mixed Graph of Terms: Beyond the Bags of Words Representation of a Text

机译:术语混合图:超越文本表示法的词袋

获取原文
获取原文并翻译 | 示例

摘要

The main purpose of text mining techniques is to identify common patterns through the observation of vectors of features and then to use such patterns to make predictions. Vectors of features are usually made up of weighted words, as well as those used in the text retrieval field, which are obtained thanks to the assumption that considers a document as a "bag of words". However, in this paper we demonstrate that, to obtain more accuracy in the analysis and revelation of common patterns, we could employ (observe) more complex features than simple weighted words. The proposed vector of features considers a hierarchical structure, named a mixed Graph of Terms, composed of a directed and an undirected sub-graph of words, that can be automatically constructed from a small set of documents through the probabilistic Topic Model. The graph has demonstrated its efficiency in a classic "ad-hoc" text retrieval problem. Here we consider expanding the initial query with this new structured vector of features.
机译:文本挖掘技术的主要目的是通过观察特征向量来识别常见模式,然后使用此类模式进行预测。特征向量通常由加权词以及在文本检索字段中使用的词构成,这要归功于将文档视为“词袋”的假设。但是,在本文中,我们证明了,为了在分析和揭示常见模式时获得更高的准确性,我们可以采用(观察)比简单加权单词更复杂的功能。拟议的特征向量考虑了一个分层结构,称为混合术语图,由单词的有向和无向子图组成,可以通过概率主题模型从一小组文档中自动构建这些图。该图在经典的“临时”文本检索问题中证明了其效率。在这里,我们考虑使用这种新的结构化特征向量来扩展初始查询。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号