首页> 外文会议>Semantic Computing, 2009. ICSC '09 >Stopword Graphs and Authorship Attribution in Text Corpora
【24h】

Stopword Graphs and Authorship Attribution in Text Corpora

机译:文本语料库中的停用词图和作者身份归属

获取原文
获取外文期刊封面目录资料

摘要

In this work we identify interactions of stopwords -noisewords- in text corpora as a fundamental feature to effect author classification. It is convenient to view such interactions as graphs wherein nodes are stopwords and the interaction between a pair of stopwords are represented as edge-weights. We define the interaction in terms of the distances between pairs of stopwords in text documents. Given a list of authors, graphs for each author is computed based on their undisputed writings. Authorship of a test document is attributed based on the closeness of the graph derived from it to the above graphs. Towards this, we define a closeness measure to compare such graphs based on the Kullback-Leibler divergence. We illustrate the accuracy of our approach by applying it on examples drawn from the Gutenberg archives. Our results show that the proposed approach is effective not only in binary author classification but also performs multiclass author classification for as many as 10 authors at a time and compares favourably with the state-of-the-art in author identification.
机译:在这项工作中,我们将停用词-噪音词-在文本语料库中的交互作为影响作者分类的基本特征。方便地将这样的交互视为图表,其中节点是停用词,一对停用词之间的交互表示为边缘权重。我们根据文本文档中停用词对之间的距离来定义交互。给定作者列表,将根据其无可争议的著作来计算每个作者的图表。测试文档的作者身份是基于从该文档得出的图形与上述图形之间的接近程度而得出的。为此,我们定义了一种接近度度量,以基于Kullback-Leibler散度比较此类图。通过将其应用于从古腾堡(Gutenberg)档案中提取的示例,我们将说明该方法的准确性。我们的结果表明,所提出的方法不仅在二进制作者分类中有效,而且一次可对多达10位作者执行多类作者分类,并且与最新的作者识别相比具有优势。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号