首页> 外文期刊>Journal of the Korean Physical Society >Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks
【24h】

Multi-Label Classification of Historical Documents by Using Hierarchical Attention Networks

机译:使用分层注意网络多标签分类历史文档

获取原文
获取原文并翻译 | 示例
           

摘要

The quantitative analysis of digitized historical documents has begun in earnest in recent years. Text classification is of particular importance for quantitative historical analysis because it helps to search literature efficiently and to determine the important subjects of a particular age. While numerous historians have joined together to classify large-scale historical documents, consistent classification among individual researchers has not been achieved. In this study, we present a classification method for large-scale historical data that uses a recently developed supervised learning algorithm called the Hierarchical Attention Network (HAN). By applying various classification methods to the Annals of the Joseon Dynasty (AJD), we show that HAN is more accurate than conventional techniques with word-frequency-based features. HAN provides the extent that a particular sentence or word contributes to the classification process through a quantitative value called 'attention'. We extract the representative keywords from various categories by using the attention mechanism and show the evolution of the keywords over the 472-year span of the AJD. Our results reveal that largely two groups of event categories are found in the AJD. In one group, the representative keywords of the categories were stable over long periods while the keywords in the other group varied rapidly, exhibiting repeatedly changing characteristics of the categories. Observing such macroscopic changes of representative words may provide insight into how a particular topic changes over a historical period.
机译:在近年来,数字化历史文件的定量分析已开始认真。文本分类对于定量历史分析特别重要,因为它有助于有效地搜索文献,并确定特定年龄的重要主题。虽然众多历史学家联合在一起来分类大规模的历史文件,但个人研究人员之间的一致分类尚未实现。在这项研究中,我们为大规模历史数据提供了一种使用最近开发的监督学习算法的大规模历史数据,称为分层关注网络(HAN)。通过将各种分类方法应用于Joseon Dynasty(AJD)的历史,我们表明汉族比具有基于词频率的特征的传统技术更准确。汉族提供特定句子或单词通过称为“注意”的定量值对分类过程有助于贡献。我们通过使用注意机制从各种类别中提取代表性关键字,并在AJD的472年跨度显示关键字的演变。我们的结果表明,AJD中发现了两组事件类别。在一个组中,类别的代表性关键字在长期内稳定,而另一组的关键字迅速变化,表现出多次改变类别的特征。观察代表性词语的这种宏观变化可能会深入了解特定主题如何在历史时期发生变化。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号