首页> 外文会议>European conference on IR research >Stopword Detection for Streaming Content
【24h】

Stopword Detection for Streaming Content

机译:用于流内容的停用词检测

获取原文

摘要

The removal of stopwords is an important preprocessing step in many natural language processing tasks, which can lead to enhanced performance and execution time. Many existing methods either rely on a predefined list of stopwords or compute word significance based on metrics such as tf-idf. The objective of our work in this paper is to identify stopwords, in an unsupervised way, for streaming textual corpora such as Twitter, which have a temporal nature. We propose to consider and model the dynamics of a word within the streaming corpus to identify the ones that are less likely to be informative or discriminative. Our work is based on the discrete wavelet transform (DWT) of word signals in order to extract two features, namely scale and energy. We show that our proposed approach is effective in identifying stopwords and improves the quality of topics in the task of topic detection.
机译:在许多自然语言处理任务中,停用词的删除是重要的预处理步骤,可以提高性能和执行时间。许多现有方法要么依赖于预定义的停用词列表,要么基于度量(例如tf-idf)来计算单词的重要性。我们本文的工作目标是以无监督的方式为流文本文本语料库(例如Twitter)识别停用词,这些词语具有时间性。我们建议考虑并建模流式语料库中单词的动态特性,以识别不太可能提供有益信息或区分性的单词。我们的工作基于单词信号的离散小波变换(DWT),以提取两个特征,即尺度和能量。我们表明,我们提出的方法可以有效地识别停用词并提高主题检测任务中主题的质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号