首页> 外文会议>European Intelligence and Security Informatics Conference >Text Mining in Unclean, Noisy or Scrambled Datasets for Digital Forensics Analytics
【24h】

Text Mining in Unclean, Noisy or Scrambled Datasets for Digital Forensics Analytics

机译:在不洁净,嘈杂或扰乱数据集中的文本挖掘数字取证分析

获取原文

摘要

In our era, most of the communication between people is realized in the form of electronic messages and especially through smart mobile devices. As such, the written text exchanged suffers from bad use of punctuation, misspelling words, continuous chunk of several words without spaces, tables, internet addresses etc. which make traditional text analytics methods difficult or impossible to be applied without serious effort to clean the dataset. Our proposed method in this paper can work in massive noisy and scrambled texts with minimal preprocessing by removing special characters and spaces in order to create a continuous string and detect all the repeated patterns very efficiently using the Longest Expected Repeated Pattern Reduced Suffix Array (LERP-RSA) data structure and a variant of All Repeated Patterns Detection (ARPaD) algorithm. Meta-analyses of the results can further assist a digital forensics investigator to detect important information to the chunk of text analyzed.
机译:在我们的时代,人们之间的大多数沟通以电子消息的形式实现,尤其是通过智能移动设备实现。因此,书面文本交换了不良使用标点符号,拼写错误,拼错单词,几个单词的连续块,没有空格,表格,互联网地址等,这使得传统文本分析方法难以或不可能进行应用而没有严重努力清洁数据集。我们本文的建议方法可以通过删除特殊字符和空格来在大规模的噪声和扰乱文本中工作,以便创建连续字符串并使用最长预期的重复模式减少后缀阵列(LERP-)非常有效地检测所有重复模式。 RSA)数据结构和所有重复模式检测(ARPAD)算法的变型。结果的Meta分析可以进一步帮助数字取证调查员检测分析的文本块的重要信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号