首页> 外文会议>International Conference on Computational Intelligence and Security >Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion
【24h】

Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion

机译:基于Twitter的文档嵌入和查询扩展来防止数据泄漏的文档敏感度分类

获取原文

摘要

Document sensitivity classification is essential to prevent potential sensitive data leakage for individuals and organizations. As most of existing methods use regular expressions or data fingerprinting to classify sensitive documents, they may not fully exploit the semantic and content of a document, especially with informal messages and files. This motivates the authors to propose a novel method to classify document sensitivity in realtime with better semantic and content analysis. Taking advantages of deep learning in natural language processing, we use our pre-trained Twitter-based document embedding TD2V to encode a document or a text fragment into a fixed length vector of 300 dimensions. Then we use retrieval and automatic query expansion to retrieve a re-ranked list of semantically similar known documents, and determine the sensitivity score for a new document from those of the retrieved documents in this list. Experimental results show that our method can achieve classification accuracy of more than 99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and 98.34% for Enron dataset. Furthermore, our method can early predict a sensitive document from a short text fragment with the accuracy higher than 98.84%.
机译:文档敏感度分类对于防止个人和组织潜在的敏感数据泄漏至关重要。由于大多数现有方法都使用正则表达式或数据指纹对敏感文档进行分类,因此它们可能无法充分利用文档的语义和内容,尤其是使用非正式消息和文件时。这激励作者提出一种新颖的方法,以更好的语义和内容分析对文档敏感度进行实时分类。在自然语言处理中利用深度学习的优势,我们使用经过预训练的基于Twitter的文档嵌入TD2V将文档或文本片段编码为300尺寸的固定长度向量。然后,我们使用检索和自动查询扩展来检索语义相似的已知文档的重新排序列表,并从该列表中检索到的文档中确定新文档的敏感度得分。实验结果表明,我们的方法对4个数据集(snowden,Mormon,Dyncorp,TM)的分类精度均达到99.9%以上,对Enron数据集的分类精度达到98.34%以上。此外,我们的方法可以从短文本片段中早期预测敏感文档,其准确性高于98.84 \%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号