首页> 中文学位 >Improving Statistical Bayesian Spam Filtering Algorithms
【6h】

Improving Statistical Bayesian Spam Filtering Algorithms

代理获取

目录

英文文摘

论文说明:List of Tables 、List of Figures

Hunan University Statement of Originality and Copyright Statement

Acknowledgements

Chapter 1 Introduction

1.1 Spam and its Types

1.2 Anti-spamming Techniques

1.3 Previous Works on Bayesian Spam Filtering

1.4 Contributions

1.5 Thesis Organization

Chapter2 Statistical Bayesian Spam Filtering Algorithms

2.1 Spam Filtering Steps

2.2 Naive Bayes (NB) Algorithm

2.3 Paul Graham's (PB) Algorithm

2.4 Gary Robinson's (GR) Algorithm

2.5 Dealing with Small Probabilities and Normalization

Chapter3 Preprocessing and Feature Selection

3.1 Preprocessing

3.2 Feature extraction or Tokenization

Chapter4FilteringBasedonCo-weightedMulti-estimations

4.1 Main Idea and Algorithm Description

4.2 Training Algorithm

4.3 Classification Algorithm

Chapter 5 Filtering Based on Co-weighted Multi-area Information

5.1 Main Idea and Algorithm Description

5.2 Training Algorithm

5.3 Classification Algorithm

Chapter 6 Dataset Collections and Evaluation Measures

6.1 Corpora Collections

6.2 Evaluation Measures

Chapter 7 Experiments and Analysis

7.1 Parameters Tuning

7.2 Experiments with Co-weighted Multi- estimations

7.2.1 Experiments and Results

7.2.2 Analysis

7.3 Experiments with Co-weighted Multi-area Information

7.3.1 Experiments and Results

7.3.2 Analysis

Chapter 8 Conclusions and Future Work

8.1 Conclusions

8.2 Future Work

Appendix A ImplementationofFilterApplication

A.1 Data Structures

A.2 Source Files

A.3 Data Files

Appendix B ApplicationUser'sManual

B.1 System Requirements

B.2 Installation of the Application

B.3 Running and Using the Application

B.3.1 Dataset Preparer

B.3.2 Trainer

B.3.3 Classifier

B.3.4 Tester

Appendix C ProgramDocumentation

C. 1 Package and Class Summaries

C.1.1 Class Summary

C.1.2 Enum Summary

C.2 Hierarchy For Package rsspambayes

C.2.1 Class Hierarchy

C.2.2 Enum Hierarchy

C.3 Class Details

C.3.1 Algorithm

C.3.2 Category

C.3.3 Classifier

C.3.4 Counts

C.3.5 DatasetPreparer

C.3.6 FreqTable

C.3.7 Frequencies

C.3.8 GRobinsonBayes

C.3.9 NaiveBayes

C.3.10 PGrahamBayes

C.3.11 RShresthaBayesl

C.3.12 RShresthaBayes2

C.3.13 Stats

C.3.14 Tester

C.3.15 Tokenizer

C.3.16 Trainer

C.3.17 Utils

C.4 Enum Details

C.4.1 Algorithms

C.4.2 Areas

C.4.3 EmailCats

C.4.4 Headers

C.4.5 HtmlTags

C.4.6 Method Detail for All Enum Types

Appendix D ListofPapersPublished

Bibliography

展开▼

摘要

The aim of this thesis is to improve accuracy of Bayesian spam filtering, the most popular and widely used approach in spam filtering. Among the various possible approaches to this aim, two approaches that improved the filtering performances arepresented in this thesis. Three popular evolutions of Bayesian spam filtering algo rithms: Naive Bayes, Paul Graham's and Gary Robinson's are reviewed. Formulated on top of those evolutions, proposed algorithms incorporate new novel ideas. The first approach proposed is co-weighting of multiple probability estimations. Though based on Bayesian theorem, several ways of computing probability estima tions have been proposed and used. Those estimations are examined and a new,combined, more effective estimation based on co-weighted multi-estimations is pro posed. The approach is compared with individual estimations. The second approach is based on co-weighted multi-area information. Bayesian spam filters, in general, compute probability estimations for tokens either without considering the email areas of occurrences except the body or treating the same token occurred in different areas as different tokens. However, in reality the same token occurring in different areas are inter-related and the relation too could play role in the classific ation. This novel idea is incorporated, co-relating multi-area information by co-weighting them and obtaining more effective combined integrated probability estimations for tokens. It is shown that this approach also improves the performance of spam filtering. The new approach is compared with individual area-wise estimations and traditional separate estimations in all areas. The filters are tested by thorough experiments with three well known public cor pora: Ling Spam, Spam Assassin and Annexia/Xpert and they are evaluated using several performance measures. Both the proposed approaches are shown to exhibit significant improvement, stability, robustness and consistency in the spam filtering.Algorithms

著录项

相似文献

  • 中文文献
  • 外文文献
  • 专利
代理获取

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号