首页> 外文会议>IEEE International Conference on Information Reuse and Integration >Identification of Deliberately Doctored Text Documents Using Frequent Keyword Chain (FKC) Model
【24h】

Identification of Deliberately Doctored Text Documents Using Frequent Keyword Chain (FKC) Model

机译:使用频繁的关键字链(FKC)模型识别故意篡改的文本文件

获取原文

摘要

Text documents have always been the most dominant source of data available. A number of classification techniques are used to organize these documents and a majority of these classification algorithms use keywords to categorize them. It is possible to mislead such algorithms by inserting keywords ('deliberate doctoring') belonging to a class different from that of the document. Such intentional deception is done in order to rank web pages higher in searches. As text classification is used to classify e-mails, deliberate doctoring is also done as a spam filter-busting measure. In addition it may be practiced to avoid detection by security agencies. The cost of such misclassification can be high and it is a serious problem in many scenarios. In this paper we have exhaustively examined the possible methods to doctor a document which may lead to its misclassification. In the study we have concluded that a majority of the ways would involve insertion of a number of misleading keywords in close proximity. We propose the Frequent Keyword Chain model to identify such local concentration of keywords. A tool called the FKCLocater is designed around the model which identifies and highlights FKC's in a document and alerts the user to the possibility of misclassification. The tool is also used to specify various parameters to fine tune the Frequency Keyword Chain model. Experiments on Newsgroup data sets show that this model is effective.
机译:文本文档始终是最占主导地位的数据来源。许多分类技术用于组织这些文档和大多数这些分类算法使用关键字来对其进行分类。通过插入属于文档的类的关键字('刻意博士'),可以误导这些算法。这样的故意欺骗是为了在搜索中排名更高的网页。由于文本分类用于对电子邮件进行分类,因此刻意的医生也是作为垃圾邮件过滤器破坏措施所做的。此外,可能会练习以避免安全机构检测。这种错误分类的成本可能会很高,这是在许多情况下是一个严重问题。在本文中,我们彻底地检查了可能导致其错误分类的文件的可能方法。在该研究中,我们得出的结论是,大多数方式都将涉及在密集的附近插入许多误导性关键字。我们提出了频繁的关键词链模型来识别此类关键字的局部集中。一个名为Fkclocater的工具围绕着识别和突出显示文件中的模型,并在文档中提醒用户进行错误分类。该工具还用于指定各种参数以微调频率关键字链模型。新闻组数据集的实验表明,此模型是有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号