首页> 外文期刊>Information Sciences: An International Journal >Linguistic data mining with complex networks: A stylometric-oriented approach
【24h】

Linguistic data mining with complex networks: A stylometric-oriented approach

机译:具有复杂网络的语言数据挖掘:面向训练轴的方法

获取原文
获取原文并翻译 | 示例
           

摘要

By representing a text by a set of words and their co-occurrences, one obtains a word-adjacency network being a reduced representation of a given language sample. In this paper, the possibility of using network representation to extract information about individual language styles of literary texts is studied. By determining selected quantitative characteristics of the networks and applying machine learning algorithms, it is possible to distinguish between texts of different authors. Within the studied set of texts, English and Polish, a properly rescaled weighted clustering coefficients and weighted degrees of only a few nodes in the word-adjacency networks are sufficient to obtain the authorship attribution accuracy over 90%. A correspondence between the text authorship and the word-adjacency network structure can therefore be found. The network representation allows to distinguish individual language styles by comparing the way the authors use particular words and punctuation marks. The presented approach can be viewed as a generalization of the authorship attribution methods based on simple lexical features.
机译:通过代表一组单词及其共同发生的文本,获得一个单词邻接网络,其是给定语言样本的减少表示。在本文中,研究了使用网络表示来提取关于文学文本的各个语言风格的信息的可能性。通过确定网络和应用机器学习算法的所选择的定量特性,可以区分不同作者的文本。在研究的文本集合中,英语和波兰语集中,单词邻接网络中仅几个节点的适当重新划分的加权聚类系数和加权度是足以获得90%以上的作者归属精度。因此可以找到文本作者身份和邻接网络结构之间的对应关系。网络表示允许通过比较作者使用特定单词和标点符号的方式来区分单个语言样式。呈现的方法可以被视为基于简单词汇特征的作者归因方法的概括。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号