【24h】

Rule Mining in Textual Data Using Passages

机译:使用段落在文本数据中挖掘

获取原文

摘要

As the interest and needs for Knowledge Discovery and Data Mining (KDD) in texts increases, applying of association rule mining, the successful standard KDD method, to texts has attracted great attention. But contrary to the expectations, most of the works resulted acquiring syntactic rules or collocation of words, which are not satisfying in the context of KDD, where the objective is to extract previously unknown, useful information. One of the reasons of the unpleasing results can be due to the fact that most of the previous works process texts on syntactic base. For example, past works used words as items and documents as transactions, words and windows, terms and documents, words and passages (segment of text) respectivly. Here we propose a way of using passages as items and documents as transactions. According to [5], breaking down long text into passages will improve the result of information retrieval. This result indicates that passages are good indication of users' interests. We follow and extend this view, and take passages as an indication of topics in a document. Our goal is to find an association between topic in documents instead of association between words. The important issue of using passage is how to compare between passages which usally consists of set of words. Since the number and frequency of words which appear in passage are different passages to passages, there is no way to compare passages directly. We must convert them to some other processable representation.. In this paper we propose a representation of passage, and discuss a way to compare between passages with the capability to apply soft matching.
机译:由于知识发现和数据挖掘(KDD)在文本中的兴趣和需求增加,关联规则挖掘,成功标准KDD方法,文本的应用引起了极大的关注。但与预期相反,大多数作品导致获取句法规则或单词的搭配,这些词语不满足KDD的背景,其中目标是提取先前未知的,有用的信息。令人难倒的结果的原因之一可能是由于大多数以前的工程在句法基础上的文本。例如,过去的作品使用单词作为项目和文档作为事务,单词和Windows,术语和文档,单词和段落(文本段)。在这里,我们提出了一种使用段落作为物品和文件作为交易的方式。根据[5],将长文本分解为段落将改善信息检索的结果。这结果表明,段落是用户兴趣的良好指示。我们遵循并扩展此视图,并将段落作为文档中的主题指示。我们的目标是在文档中找到主题之间的关联而不是单词之间的关联。使用段落的重要问题是如何在段落之间进行比较,这通常由一组单词组成。由于段落中出现的单词的数量和频率是对段落不同的段落,因此无法直接比较段落。我们必须将它们转换为其他一些可加工的代表。在本文中,我们提出了一种段落的表示,并讨论了在段落之间进行比较,以应用软匹配的能力。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号