【24h】

Finding keywords amongst noise: Automatic text classification without parsing

机译:在噪音中寻找关键字:自动文本分类,无需解析

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

The amount of text stored on the Internet, and in our libraries, continues to expand at an exponential rate. There is a great practical need to locate relevant content. This requires quick automated methods for classifying textual information, according to subject. We propose a quick statistical approach, which can distinguish between 'keywords' and 'noisewords', like 'the' and 'a', without the need to parse the text into its parts of speech. Our classification is based on an F-statistic, which compares the observed Word Recurrence Interval (WRI) with a simple null hypothesis. We also propose a model to account for the observed distribution of WRI statistics and we subject this model to a number of tests.
机译:Internet和我们的图书馆中存储的文本数量继续呈指数级增长。定位相关内容非常需要实践。这需要根据主题快速自动地对文本信息进行分类的方法。我们提出了一种快速的统计方法,该方法可以区分“关键词”和“噪音词”(例如“ the”和“ a”),而无需将文本解析为词性。我们的分类基于F统计量,该统计量将观察到的单词重复间隔(WRI)与简单的虚假假设进行比较。我们还提出了一个模型,以说明观察到的WRI统计信息的分布,并且对该模型进行了大量测试。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号