Stopword Detection for Streaming Content

机译：用于流内容的停用词检测

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The removal of stopwords is an important preprocessing step in many natural language processing tasks, which can lead to enhanced performance and execution time. Many existing methods either rely on a predefined list of stopwords or compute word significance based on metrics such as tf-idf. The objective of our work in this paper is to identify stopwords, in an unsupervised way, for streaming textual corpora such as Twitter, which have a temporal nature. We propose to consider and model the dynamics of a word within the streaming corpus to identify the ones that are less likely to be informative or discriminative. Our work is based on the discrete wavelet transform (DWT) of word signals in order to extract two features, namely scale and energy. We show that our proposed approach is effective in identifying stopwords and improves the quality of topics in the task of topic detection.

机译：在许多自然语言处理任务中，停用词的删除是重要的预处理步骤，可以提高性能和执行时间。许多现有方法要么依赖于预定义的停用词列表，要么基于度量（例如tf-idf）来计算单词的重要性。我们本文的工作目标是以无监督的方式为流文本文本语料库（例如Twitter）识别停用词，这些词语具有时间性。我们建议考虑并建模流式语料库中单词的动态特性，以识别不太可能提供有益信息或区分性的单词。我们的工作基于单词信号的离散小波变换（DWT），以提取两个特征，即尺度和能量。我们表明，我们提出的方法可以有效地识别停用词并提高主题检测任务中主题的质量。

著录项

来源
《European conference on IR research》|2018年|737-743|共7页
会议地点
作者
Hossein Fani; Masoud Bashari; Fattane Zarrinkalam; Ebrahim Bagheri; Feras Al-Obeidat;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. Plagiarism Detection Using Stopword n-grams [J] . Efstathios Stamatatos Journal of the American Society for Information Science and Technology . 2011,第12期

机译：使用停用词n-gram进行抄袭检测
2. Reconstructing streamed video content: A case study on YouTube and Facebook Live stream content in the Chrome web browser cache [J] . Horsman Graeme Digital investigation . 2018,第JULa期

机译：重构流视频内容：以YouTube和Chrome浏览器缓存中的YouTube和Facebook Live流内容为例
3. Validation and Implementation of a Diagnostic Algorithm for DNA Detection of Bordetella pertussis, B. parapertussis, and B. holmesii in a Pediatric Referral Hospital in Barcelona, Spain [J] . Ana Valero-Rello, Desiree Henares, Lesly Acosta, Journal of Clinical Microbiology . 2019,第1期

机译：百日咳博德特氏菌， B的DNA检测诊断算法的验证和实现。副瘫痪和 B。 holmesii 在西班牙巴塞罗那的儿科转诊医院
4. Stopword Detection for Streaming Content [C] . Hossein Fani, Masoud Bashari, Fattane Zarrinkalam, European Conference on Information Retrieval Research . 2018

机译：用于流媒体内容的止动键检测
5. Novel Class Detection and Cross-Lingual Duplicate Detection Over Online Data Stream [D] . Mustafa, Ahmad Mohammad. 2018

机译：在线数据流上的新型类检测和跨语言重复检测
6. IoT-Stream: A Lightweight Ontology for Internet of Things Data Streams and Its Use with Data Analytics and Event Detection Services [O] . Tarek Elsaleh, Shirin Enshaeifar, Roonak Rezvani, 2020

机译：IoT-Stream：物联网数据流的轻量级本体及其与数据分析和事件检测服务的结合
7. Plagiarism Detection Using Stopword n-Grams [O] . Efstathios Stamatatos 2013

机译：使用停用词n-Grams进行抄袭检测

Stopword Detection for Streaming Content

摘要

著录项

相似文献

相关主题

期刊订阅