Recently many people publicize their blogs and the blogosphere becomes an important information source. It is used for various purposes such as analyzing trends and reputations, marketing, etc. One problem of blogosphere is spam like e-mails and web links. There are many spam blogs (splogs) that are generated to make users to access specific sites. This paper proposes a splog filtering method. Splog is usually generated automatically by copying words and phrases from other documents. Therefore, the proposed method detects strings appearing in multiple blogs and uses a copy rate of strings as a key feature for splog filtering. To evaluate the proposed method, we constructed an evaluation corpus by gathering blogs randomly during a certain period of time and manually judged whether each blog is splog or not. The experiment using this corpus reveals several features of splog filtering by copy string detection. The proposed method uses the suffix array for copied substring detection and it can judge each blog with time complexity of O(m{sup}2 log n) where n and m denote total length of documents used for copy detection and the lengths of the blog to be judged, respectively.
展开▼