首页> 外文会议>International Multi-Conference on Systems, Signals amp;amp;amp;amp;amp;amp; Devices >Authors' Writing Styles Based Authorship Identification System Using the Text Representation Vector
【24h】

Authors' Writing Styles Based Authorship Identification System Using the Text Representation Vector

机译:作者使用文本表示向量的基于作者的作者身份识别系统

获取原文

摘要

Text mining is one of the main and typical tasks of machine learning (ML). Authorship identification (AI) is a standard research subject in text mining and natural language processing (NLP) that has undergone a remarkable evolution these last years. We need to identify/determine the actual author of anonymous texts given on the basis of a set of writing samples. Standard text classification often focuses on many handcrafted features such as dictionaries, knowledge bases, and different stylometric characteristics, which often leads to remarkable dimensionality. Unlike traditional approaches, this paper suggests an authorship identification approach based on automatic feature engineering using word2vec word embeddings, taking into account each author's writing style. This system includes two learning phases, the first stage aims to generate the semantic representation of each author by using word2vec to learn and extract the most relevant characteristics of the raw document. The second stage is to apply the multilayer perceptron (MLP) classifier to fix the classification rules using the backpropagation learning algorithm. Experiments show that MLP classifier with word2vec model earns an accuracy of 95.83% for an English corpus, suggesting that the word2vec word embedding model can evidently enhance the identification accuracy compared to other classical models such as n-gram frequencies and bag of words.
机译:文本挖掘是机器学习(ML)的主要和典型任务之一。作者身份证明(AI)是文本挖掘和自然语言处理(NLP)中的标准研究主题,过去几年已经经历了显着的演变。我们需要识别/确定基于一组写作样本给出的匿名文本的实际作者。标准文本分类通常侧重于许多手工制作的功能,例如词典,知识库和不同款式特征,这通常会导致显着的维度。与传统方法不同,本文介绍了使用Word2Vec Word Embeddings的自动特征工程的作者识别方法,考虑到每个作者的写作风格。该系统包括两个学习阶段,第一阶段旨在通过使用Word2VEC来生成每个作者的语义表示来学习和提取原始文档的最相关的特征。第二阶段是应用MultiDayer Perceptron(MLP)分类器来使用BackProjagation学习算法来修复分类规则。实验表明,具有Word2VEC模型的MLP分类器为英语语料库获得了95.83%的准确性,表明与其他经典模型和单词袋等其他经典模型相比,Word2Vec字嵌入模型可以显然提高识别准确性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号