...
首页> 外文期刊>Journal of the American Society for Information Science and Technology >SUDMAD: Sequential and Unsupervised Decomposition of a Multi-Author Document Based on a Hidden Markov Model
【24h】

SUDMAD: Sequential and Unsupervised Decomposition of a Multi-Author Document Based on a Hidden Markov Model

机译:SUDMAD:基于隐马尔可夫模型的多作者文档的顺序和无监督分解

获取原文
获取原文并翻译 | 示例

摘要

Decomposing a document written by more than one author into sentences based on authorship is of great significance due to the increasing demand for plagiarism detection, forensic analysis, civil law (i.e., disputed copyright issues), and intelligence issues that involve disputed anonymous documents. Among existing studies for document decomposition, some were limited by specific languages, according to topics or restricted to a document of two authors, and their accuracies have big room for improvement. In this paper, we consider the contextual correlation hidden among sentences and propose an algorithm for Sequential and Unsupervised Decomposition of a Multi-Author Document (SUDMAD) written in any language, disregarding topics, through the construction of a Hidden Markov Model (HMM) reflecting the authors' writing styles. To build and learn such a model, an unsupervised, statistical approach is first proposed to estimate the initial values of HMM parameters of a preliminary model, which does not require the availability of any Information of author's or document's context other than how many authors contributed to writing the document. To further boost the performance of this approach, a boosted HMM learning procedure is proposed next, where the initial classification results are used to create labeled training data to learn a more accurate HMM. Moreover, the contextual relationship among sentences is further utilized to refine the classification results. Our proposed approach is empirically evaluated on three benchmark datasets that are widely used for authorship analysis of documents. Comparisons with recent state-of-the-art approaches are also presented to demonstrate the significance of our new ideas and the superior performance of our approach.
机译:由于对窃检测,法医分析,民法(即有争议的版权问题)以及涉及有争议的匿名文件的情报问题的需求不断增加,因此将多位作者撰写的文档分解为基于作者身份的句子具有重要意义。在现有的文档分解研究中,有些研究根据主题受到特定语言的限制,或者仅限于两位作者的文档,其准确性还有很大的提高空间。在本文中,我们考虑了句子之间隐藏的上下文相关性,并提出了一种通过构建隐马尔可夫模型(HMM)来反映以多种语言编写的,不考虑主题的多作者文档(SUDMAD)的顺序和无监督分解算法。作者的写作风格。为了构建和学习这样的模型,首先提出了一种无监督的统计方法来估计初步模型的HMM参数的初始值,该方法不需要提供多少作者或文档上下文信息,而是需要多少作者贡献。编写文档。为了进一步提高该方法的性能,接下来提出了增强的HMM学习程序,其中,初始分类结果用于创建标记的训练数据以学习更准确的HMM。此外,句子之间的上下文关系被进一步利用以细化分类结果。我们对三种基准数据集进行了经验评估,该基准数据集广泛用于文档的作者分析。还提出了与最新技术方法的比较,以证明我们新思想的重要性以及我们方法的卓越性能。

著录项

  • 来源
  • 作者单位

    Global Big Data Technologies Centre, University of Technology Sydney, Australia. Fujian Provincial Key Laboratory of Information Processing and Intelligent Control, Minjiang University, Fuzhou, Fujian, 350121, China;

    Global Big Data Technologies Centre, University of Technology Sydney, Australia.School of Software and Microelectronics, Northwestern Polytechnical University, China;

    Global Big Data Technologies Centre, University of Technology Sydney, Australia;

    Department of Industrial Engineering and Engineering Management, National Tsing Hua University, Taiwan;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号