【24h】

Stopword Detection for Streaming Content

机译:用于流媒体内容的止动键检测

获取原文

摘要

The removal of stopwords is an important preprocessing step in many natural language processing tasks, which can lead to enhanced performance and execution time. Many existing methods either rely on a predefined list of stopwords or compute word significance based on metrics such as tf-idf. The objective of our work in this paper is to identify stopwords, in an unsupervised way, for streaming textual corpora such as Twitter, which have a temporal nature. We propose to consider and model the dynamics of a word within the streaming corpus to identify the ones that are less likely to be informative or discriminative. Our work is based on the discrete wavelet transform (DWT) of word signals in order to extract two features, namely scale and energy. We show that our proposed approach is effective in identifying stopwords and improves the quality of topics in the task of topic detection.
机译:删除停止是许多自然语言处理任务中的重要预处理步骤,这可能导致增强的性能和执行时间。许多现有方法依赖于基于TF-IDF等度量的预定义的停止列表或计算字显着率。我们本文中的工作的目标是以无人监督的方式识别阻止的障碍,以便流媒体,如Twitter,呈现出颞性质。我们建议考虑并模拟流媒体语料库内的单词的动态,以识别不太可能是信息或歧视的那些。我们的工作基于Word信号的离散小波变换(DWT),以提取两个特征,即规模和能量。我们表明我们所提出的方法有效地识别停止词并提高主题检测任务中的主题质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号