Identification of deliberately doctored text documents using frequent keyword chain (FKC) model

机译：使用频繁关键字链（FKC）模型识别故意篡改的文本文档

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Text documents have always been the most dominant source of data available. A number of classification techniques are used to organize these documents and a majority of these classification algorithms use keywords to categorize them. It is possible to mislead such algorithms by inserting keywords ('deliberate doctoring') belonging to a class different from that of the document. Such intentional deception is done in order to rank Web pages higher in searches. As text classification is used to classify e-mails, deliberate doctoring is also done as a spam filter-busting measure. In addition, it may be practiced to avoid detection by security agencies. The cost of such misclassification can be high and it is a serious problem in many scenarios. In this paper we have exhaustively examined the possible methods to doctor a document which may lead to its misclassification. In the study we have concluded that a majority of the ways would involve insertion of a number of misleading keywords in close proximity. We propose the frequent keyword chain model to identify such local concentration of keywords. A tool called the FKCLocater is designed around the model which identifies and highlights FKC's in a document and alerts the user to the possibility of misclassification. The tool is also used to specify various parameters to fine tune the frequency keyword chain model. Experiments on newsgroup data sets show that this model is effective.

机译：文本文档一直是可用数据的最主要来源。许多分类技术被用于组织这些文档，并且这些分类算法中的大多数使用关键字对它们进行分类。通过插入属于与文档类别不同的类别的关键字（“故意篡改”），可能会误导此类算法。进行这种故意欺骗是为了使搜索网页排名更高。由于使用文本分类对电子邮件进行分类，因此故意篡改也可以作为消除垃圾邮件过滤器的一种措施。另外，可以实践避免由安全机构检测。这种错误分类的代价可能很高，并且在许多情况下是一个严重的问题。在本文中，我们详尽地研究了篡改文档的可能方法，该方法可能导致文档分类错误。在研究中，我们得出的结论是，大多数方法将涉及在附近插入大量误导性关键字。我们提出了频繁的关键词链模型来识别这种局部关键词集中。围绕该模型设计了一种称为FKCLocater的工具，该工具可以识别并突出显示文档中的FKC，并向用户发出错误分类的可能性。该工具还用于指定各种参数，以微调频率关键字链模型。对新闻组数据集进行的实验表明，该模型是有效的。

著录项

来源
《Information Reuse and Integration, 2003. IRI 2003. IEEE International Conference on》|2003年|p.398-405|共8页
会议地点
作者
Kaza S.; Murthy S.N.J.; Gongzhu Hu;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类无线电电子学、电信技术;
关键词
text analysis; classification; deliberately doctored text document identification; frequent keyword chain model; Web page ranking; text classification; spam filter-busting measure; misleading keywords; FKCLocater tool; data formats; keyword insertion;

机译：文本分析;分类;故意篡改的文本文档识别;频繁的关键字链模型;网页排名;文本分类;垃圾邮件过滤消除措施;误导性关键字; FKCLocater工具;数据格式;关键字插入;
入库时间 2022-08-26 14:09:07

相似文献

外文文献
中文文献
专利

1. Mining Frequent Phrase Patterns of Keywords from Text Data [J] . P.C. Saxena, Asok De, Rajni Jindal Asian Journal of Information Technology . 2008,第11期

机译：从文本数据中挖掘关键字的频繁短语模式
2. Clustering of text documents with keyword weighting function [J] . A. Christy, G. Meera Gandhi, S. Vaithyasubramanian International Journal of Intelligent Enterprise . 2019,第1期

机译：群集文本文档与关键字加权函数
3. Text Document Retrieval In English Using Keywords of Indonesian Dictionary Based [J] . Jati Sasongko Wibowo, Sri Hartati Indonesian Journal of Computing and Cybernetics Systems . 2011,第1期

机译：基于印度尼西亚语词典关键词的英语文本文档检索
4. Identification of Deliberately Doctored Text Documents Using Frequent Keyword Chain (FKC) Model [C] . Siddharth Kaza, S. N. Jayaram Murthy, Gongzhu Hu IEEE International Conference on Information Reuse and Integration . 2003

机译：使用频繁的关键字链（FKC）模型识别故意篡改的文本文件
5. Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining. [D] . Thaicharoen, Supphachai. 2009

机译：带有跨句推理的文本关联挖掘，基于结构的文档模型和多关系文本挖掘。
6. BoB a best-of-breed automated text de-identification system for VHA clinical documents [O] . Oscar Ferrández, Brett R South, Shuying Shen, -1

机译：BoB用于VHA临床文档的同类最佳自动文本去识别系统
7. Identification of deliberately doctored text documents using frequent keyword chain (FKC) model [O] . Siddharth Kaza, S. N. Jayaram Murthy, Gongzhu Hu 2003

机译：使用频繁关键字链（FKC）模型识别故意篡改的文本文档

Identification of deliberately doctored text documents using frequent keyword chain (FKC) model

摘要

著录项

相似文献

相关主题

期刊订阅