首页> 外文会议>Cybercrime and Trustworthy Computing Workshop >A Comparative Study of Likelihood Ratio Based Forensic Text Comparison Procedures: Multivariate Kernel Density with Lexical Features vs. Word N-grams vs. Character N-grams
【24h】

A Comparative Study of Likelihood Ratio Based Forensic Text Comparison Procedures: Multivariate Kernel Density with Lexical Features vs. Word N-grams vs. Character N-grams

机译:基于似然比的法医文本比较程序的比较研究:词汇特征的多变量核心密度与字N-克与字符n克

获取原文

摘要

This is a comparative study to empirically investigate the performances of three different procedures for calculating authorship attribution likelihood ratios (LR). The procedures to be compared are: 1) a procedure based on multivariate kernel density (MVKD) with lexical features; 2) a procedure based on word N-grams; and 3) a procedure based on character N-grams. Furthermore, the best-performing LRs of these three procedures are fused into combined single LRs using a logistic-regression fusion, in order to investigate the extent of the improvement/deterioration that the fusion brings about. This study uses chatlog messages, which were presented as evidence to prosecute paedophiles, for testing. The numbers of word tokens used to model the authorship attribution of each message group are 500 and 1000 words. This was done to examine the effect of sample size on the performance of a system. The performance of a system is assessed with regard to its validity (= accuracy) and reliability (= precision) using the log-likelihood-ratio cost (Cllr) and 95% credible intervals (CI), respectively. While describing the different characteristics of these three procedures in their outcomes, this study demonstrates that the MVKD procedure was the best-performing procedure out of the three in terms of Cllr . This study also demonstrates that a logistic-regression fusion is useful for combining the LRs obtained from the three procedures in question, resulting in a good improvement in performance.
机译:这是一个比较研究,以便明确调查三种不同程序的表演来计算作者归因似然比(LR)。要进行比较的程序是:1)基于具有词汇特征的多元核密度(MVKD)的过程; 2)基于n-grams的过程; 3)基于角色n-gram的过程。此外,使用逻辑回归融合,这三种方法的最佳性能LRS融合到组合的单个LR中,以研究融合带来的改进/恶化的程度。本研究使用Chatlog消息,呈现为检测恋童癖者的证据,以进行测试。用于模拟每个消息组的Autheration归属的单词令牌的数量为500和1000字。这是为了检查样本大小对系统性能的影响。根据其有效性(=精度)和可靠性(=精度)分别使用逻辑似然比成本(CLLR)和95%可信间隔(CI)来评估系统的性能。在描述其结果的这三个程序的不同特征的同时,本研究表明,MVKD程序是在CLLR方面的三个中最佳的程序。本研究还表明,逻辑回归融合对于组合从有问题的三个程序中获得的LRS是有用的,导致性能良好。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号