首页> 外文会议>IEEE International Congress on Big Data >Exploring archives with probabilistic models: Topic modelling for the valorisation of digitised archives of the European Commission
【24h】

Exploring archives with probabilistic models: Topic modelling for the valorisation of digitised archives of the European Commission

机译:使用概率模型探究档案:欧盟委员会对数字化档案进行估价的主题模型

获取原文

摘要

Topic Modelling (TM) has gained momentum over the last few years within the humanities to analyze topics represented in large volumes of full text. This paper proposes an experiment with the usage of TM based on a large subset of digitized archival holdings of the European Commission (EC). Currently, millions of scanned and OCRed files are available and hold the potential to significantly change the way historians of the construction and evolution of the European Union can perform their research. However, due to a lack of resources, only minimal metadata are available on a file and document level, seriously undermining the accessibility of this archival collection. The article explores in an empirical manner the possibilities and limits of TM to automatically extract key concepts from a large body of documents spanning multiple decades. By mapping the topics to headings of the EUROVOC thesaurus, the proof of concept described in this paper offers the future possibility to represent the identified topics with the help of a hierarchical search interface for end-users.
机译:在过去的几年中,主题模型(TM)在人文学科领域获得了发展势头,可以分析大量全文中表示的主题。本文基于欧洲委员会(EC)的大量数字化档案馆藏,提出了使用TM的实​​验。当前,有数以百万计的扫描文件和OCRed文件可供使用,它们有可能显着改变欧盟建设与发展史学家进行研究的方式。但是,由于缺乏资源,因此在文件和文档级别只能使用最少的元数据,从而严重破坏了此档案馆藏的可访问性。本文以经验的方式探索了TM的可能性和局限性,这些TM可以自动地从跨越数十年的大量文档中提取关键概念。通过将主题映射到EUROVOC词库的标题,本文所述的概念证明提供了未来的可能性,借助面向最终用户的分层搜索界面来表示已标识的主题。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号