首页> 外文会议>Information Reuse and Integration, 2003. IRI 2003. IEEE International Conference on >Identification of deliberately doctored text documents using frequent keyword chain (FKC) model
【24h】

Identification of deliberately doctored text documents using frequent keyword chain (FKC) model

机译:使用频繁关键字链(FKC)模型识别故意篡改的文本文档

获取原文

摘要

Text documents have always been the most dominant source of data available. A number of classification techniques are used to organize these documents and a majority of these classification algorithms use keywords to categorize them. It is possible to mislead such algorithms by inserting keywords ('deliberate doctoring') belonging to a class different from that of the document. Such intentional deception is done in order to rank Web pages higher in searches. As text classification is used to classify e-mails, deliberate doctoring is also done as a spam filter-busting measure. In addition, it may be practiced to avoid detection by security agencies. The cost of such misclassification can be high and it is a serious problem in many scenarios. In this paper we have exhaustively examined the possible methods to doctor a document which may lead to its misclassification. In the study we have concluded that a majority of the ways would involve insertion of a number of misleading keywords in close proximity. We propose the frequent keyword chain model to identify such local concentration of keywords. A tool called the FKCLocater is designed around the model which identifies and highlights FKC's in a document and alerts the user to the possibility of misclassification. The tool is also used to specify various parameters to fine tune the frequency keyword chain model. Experiments on newsgroup data sets show that this model is effective.
机译:文本文档一直是可用数据的最主要来源。许多分类技术被用于组织这些文档,并且这些分类算法中的大多数使用关键字对它们进行分类。通过插入属于与文档类别不同的​​类别的关键字(“故意篡改”),可能会误导此类算法。进行这种故意欺骗是为了使搜索网页排名更高。由于使用文本分类对电子邮件进行分类,因此故意篡改也可以作为消除垃圾邮件过滤器的一种措施。另外,可以实践避免由安全机构检测。这种错误分类的代价可能很高,并且在许多情况下是一个严重的问题。在本文中,我们详尽地研究了篡改文档的可能方法,该方法可能导致文档分类错误。在研究中,我们得出的结论是,大多数方法将涉及在附近插入大量误导性关键字。我们提出了频繁的关键词链模型来识别这种局部关键词集中。围绕该模型设计了一种称为FKCLocater的工具,该工具可以识别并突出显示文档中的FKC,并向用户发出错误分类的可能性。该工具还用于指定各种参数,以微调频率关键字链模型。对新闻组数据集进行的实验表明,该模型是有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号