【24h】

A Splog Filtering Method Based on String Copy Detection

机译:一种基于串复印检测的捕获过滤方法

获取原文

摘要

Recently many people publicize their blogs and the blogosphere becomes an important information source. It is used for various purposes such as analyzing trends and reputations, marketing, etc. One problem of blogosphere is spam like e-mails and web links. There are many spam blogs (splogs) that are generated to make users to access specific sites. This paper proposes a splog filtering method. Splog is usually generated automatically by copying words and phrases from other documents. Therefore, the proposed method detects strings appearing in multiple blogs and uses a copy rate of strings as a key feature for splog filtering. To evaluate the proposed method, we constructed an evaluation corpus by gathering blogs randomly during a certain period of time and manually judged whether each blog is splog or not. The experiment using this corpus reveals several features of splog filtering by copy string detection. The proposed method uses the suffix array for copied substring detection and it can judge each blog with time complexity of O(m{sup}2 log n) where n and m denote total length of documents used for copy detection and the lengths of the blog to be judged, respectively.
机译:最近许多人宣传他们的博客,博客圈成为一个重要的信息来源。它用于各种目的,例如分析趋势和声誉,营销等。博罗圈的一个问题是电子邮件和Web链接等垃圾邮件。生成许多​​垃圾邮件(拆分)以使用户访问特定站点。本文提出了一种捕获滤波方法。拼接通常通过复制来自其他文档的单词和短语自动生成。因此,所提出的方法检测到多个博客中出现的字符串,并使用字符串的副本速率作为捕获过滤的关键特征。为了评估所提出的方法,我们通过在一段时间内随机收集博客来构建评估语料库,并在一段时间内随机收集博客,并手动判断每个博客是否是捕果。使用此语料库的实验揭示了通过复制字符串检测删除捕获过滤的几个特征。该方法使用后缀阵列进行复制的子字符串检测,并且它可以判断每个博客的时间复杂度(m {sup} 2 log n),其中n和m表示用于复制检测的文档的总长度和博客的长度分别判断。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号