Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion

机译：基于Twitter的文档嵌入和查询扩展来防止数据泄漏的文档敏感度分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Document sensitivity classification is essential to prevent potential sensitive data leakage for individuals and organizations. As most of existing methods use regular expressions or data fingerprinting to classify sensitive documents, they may not fully exploit the semantic and content of a document, especially with informal messages and files. This motivates the authors to propose a novel method to classify document sensitivity in realtime with better semantic and content analysis. Taking advantages of deep learning in natural language processing, we use our pre-trained Twitter-based document embedding TD2V to encode a document or a text fragment into a fixed length vector of 300 dimensions. Then we use retrieval and automatic query expansion to retrieve a re-ranked list of semantically similar known documents, and determine the sensitivity score for a new document from those of the retrieved documents in this list. Experimental results show that our method can achieve classification accuracy of more than 99.9% for 4 datasets (snowden, Mormon, Dyncorp, TM) and 98.34% for Enron dataset. Furthermore, our method can early predict a sensitive document from a short text fragment with the accuracy higher than 98.84%.

机译：文档敏感度分类对于防止个人和组织潜在的敏感数据泄漏至关重要。由于大多数现有方法都使用正则表达式或数据指纹对敏感文档进行分类，因此它们可能无法充分利用文档的语义和内容，尤其是使用非正式消息和文件时。这激励作者提出一种新颖的方法，以更好的语义和内容分析对文档敏感度进行实时分类。在自然语言处理中利用深度学习的优势，我们使用经过预训练的基于Twitter的文档嵌入TD2V将文档或文本片段编码为300尺寸的固定长度向量。然后，我们使用检索和自动查询扩展来检索语义相似的已知文档的重新排序列表，并从该列表中检索到的文档中确定新文档的敏感度得分。实验结果表明，我们的方法对4个数据集（snowden，Mormon，Dyncorp，TM）的分类精度均达到99.9％以上，对Enron数据集的分类精度达到98.34％以上。此外，我们的方法可以从短文本片段中早期预测敏感文档，其准确性高于98.84 \％。

著录项

来源
《International Conference on Computational Intelligence and Security》|2017年|537-542|共6页
会议地点
作者
Lap Q. Trieu; Trung-Nguyen Tran; Mai-Khiem Tran; Minh-Triet Tran;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Sensitivity; Semantics; Training; Task analysis; Market research; Natural language processing; Electronic mail;

机译：敏感性;语义;培训;任务分析;市场研究;自然语言处理;电子邮件;

相似文献

外文文献
中文文献
专利

1. Using Query Expansion In Graph-based Approach For Query-focused Multi-document Summarization [J] . Lin Zhao, Lide Wu, Xuanjing Huang Information Processing & Management . 2009,第1期

机译：在基于图的方法中使用查询扩展进行以查询为中心的多文档摘要
2. Query Expansion for Document Retrieval by Mining Additional Query Terms [J] . Hsi-Ching Lin, Li-Hui Wang, Shyi-Ming Chen International Journal of Information and Management Sciences . 2008,第1期

机译：通过挖掘其他查询词扩展文档检索的查询
3. A New Hybrid Document Clustering for PRF-Based Automatic Query Expansion Approach for Effective IR [J] . International Journal of E-Collaboration . 2020,第3期

机译：基于PRF的自动查询扩展方法的新的混合文档聚类
4. Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion [C] . Lap Q. Trieu, Trung-Nguyen Tran, Mai-Khiem Tran, International Conference on Computational Intelligence and Security . 2017

机译：基于Twitter的文档嵌入和查询扩展的数据泄漏预防文档敏感性分类
5. Visualization of search engine query result using region-based document model on XML documents. [D] . Parikh, Sunish Umesh. 2000

机译：在XML文档上使用基于区域的文档模型来可视化搜索引擎查询结果。
6. Framing Electronic Medical Records as Polylingual Documents in Query Expansion [O] . Edward W Huang, Sheng Wang, Doris Jung-Lin Lee, 2017

机译：在查询扩展中将电子病历构造为多语言文档
7. Creating Collections with Embedded Documents for Document Databases Taking into Account the Queries [O] . Yulia Shichkina, Muon Ha 2020

机译：考虑到查询，使用嵌入式文档创建具有嵌入式文档的集合

Document Sensitivity Classification for Data Leakage Prevention with Twitter-Based Document Embedding and Query Expansion

摘要

著录项

相似文献

相关主题

期刊订阅