Identification of Deliberately Doctored Text Documents Using Frequent Keyword Chain (FKC) Model

机译：使用频繁的关键字链（FKC）模型识别故意篡改的文本文件

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Text documents have always been the most dominant source of data available. A number of classification techniques are used to organize these documents and a majority of these classification algorithms use keywords to categorize them. It is possible to mislead such algorithms by inserting keywords ('deliberate doctoring') belonging to a class different from that of the document. Such intentional deception is done in order to rank web pages higher in searches. As text classification is used to classify e-mails, deliberate doctoring is also done as a spam filter-busting measure. In addition it may be practiced to avoid detection by security agencies. The cost of such misclassification can be high and it is a serious problem in many scenarios. In this paper we have exhaustively examined the possible methods to doctor a document which may lead to its misclassification. In the study we have concluded that a majority of the ways would involve insertion of a number of misleading keywords in close proximity. We propose the Frequent Keyword Chain model to identify such local concentration of keywords. A tool called the FKCLocater is designed around the model which identifies and highlights FKC's in a document and alerts the user to the possibility of misclassification. The tool is also used to specify various parameters to fine tune the Frequency Keyword Chain model. Experiments on Newsgroup data sets show that this model is effective.

机译：文本文档始终是最占主导地位的数据来源。许多分类技术用于组织这些文档和大多数这些分类算法使用关键字来对其进行分类。通过插入属于文档的类的关键字（'刻意博士'），可以误导这些算法。这样的故意欺骗是为了在搜索中排名更高的网页。由于文本分类用于对电子邮件进行分类，因此刻意的医生也是作为垃圾邮件过滤器破坏措施所做的。此外，可能会练习以避免安全机构检测。这种错误分类的成本可能会很高，这是在许多情况下是一个严重问题。在本文中，我们彻底地检查了可能导致其错误分类的文件的可能方法。在该研究中，我们得出的结论是，大多数方式都将涉及在密集的附近插入许多误导性关键字。我们提出了频繁的关键词链模型来识别此类关键字的局部集中。一个名为Fkclocater的工具围绕着识别和突出显示文件中的模型，并在文档中提醒用户进行错误分类。该工具还用于指定各种参数以微调频率关键字链模型。新闻组数据集的实验表明，此模型是有效的。

著录项

来源
《IEEE International Conference on Information Reuse and Integration》|2003年||共8页
会议地点
作者
Siddharth Kaza; S. N. Jayaram Murthy; Gongzhu Hu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术;
关键词
Text document classification; Frequent keywords; Doctored document detection;

机译：文本文档分类;频繁关键词;致密的文档检测;

相似文献

外文文献
中文文献
专利

1. Mining Frequent Phrase Patterns of Keywords from Text Data [J] . P.C. Saxena, Asok De, Rajni Jindal Asian Journal of Information Technology . 2008,第11期

机译：从文本数据中挖掘关键字的频繁短语模式
2. Clustering of text documents with keyword weighting function [J] . A. Christy, G. Meera Gandhi, S. Vaithyasubramanian International Journal of Intelligent Enterprise . 2019,第1期

机译：群集文本文档与关键字加权函数
3. Text Document Retrieval In English Using Keywords of Indonesian Dictionary Based [J] . Jati Sasongko Wibowo, Sri Hartati Indonesian Journal of Computing and Cybernetics Systems . 2011,第1期

机译：基于印度尼西亚语词典关键词的英语文本文档检索
4. Identification of deliberately doctored text documents using frequent keyword chain (FKC) model [C] . Kaza S., Murthy S.N.J., Gongzhu Hu Information Reuse and Integration, 2003. IRI 2003. IEEE International Conference on . 2003

机译：使用频繁关键字链（FKC）模型识别故意篡改的文本文档
5. Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining. [D] . Thaicharoen, Supphachai. 2009

机译：带有跨句推理的文本关联挖掘，基于结构的文档模型和多关系文本挖掘。
6. BoB a best-of-breed automated text de-identification system for VHA clinical documents [O] . Oscar Ferrández, Brett R South, Shuying Shen, -1

机译：BoB用于VHA临床文档的同类最佳自动文本去识别系统
7. Identification of deliberately doctored text documents using frequent keyword chain (FKC) model [O] . Siddharth Kaza, S. N. Jayaram Murthy, Gongzhu Hu 2003

机译：使用频繁关键字链（FKC）模型识别故意篡改的文本文档

Identification of Deliberately Doctored Text Documents Using Frequent Keyword Chain (FKC) Model

摘要

著录项

相似文献

相关主题

期刊订阅