首页> 外文会议>International Workshop on Computational Processing of the Portuguese Language >Indexing Names of Persons in a Large Dataset of a Newspaper
【24h】

Indexing Names of Persons in a Large Dataset of a Newspaper

机译:在报纸的大型数据集中索引人员名称

获取原文

摘要

An index is a very good tool for finding the necessary information from a set of documents. So far, the extant index tools in both the printed and digital newspaper versions are not sufficient to help users find information. Users must browse the entire newspaper to fulfill their needs or discover later on, after spending a considerable amount of energy, that the information they had been seeking is not available. We propose here to use state-of-the-art strategies for extracting named entities specifically for person names and, with an index of names, provide the user with an important tool to find names within newspaper pages. The state-of-the-art system considered used the Golden Collection of the First and Second HAREM, a reference for Named Entity Recognition systems in Portuguese, as training and test sets respectively. Furthermore, we created a new training dataset from the actual newspaper's articles. In this case, we processed 100 articles of the newspaper and managed to correctly find 87.0% of the extant names and their respective partial citations.
机译:索引是一个非常好的工具,用于从一组文档中查找必要的信息。到目前为止,印刷和数字报纸版本中的现存索引工具不足以帮助用户找到信息。用户必须浏览整个报纸,以满足他们的需求或在花费相当数量的能量后稍后发现,所以他们寻求的信息不可用。我们在此提出使用最先进的策略来专门针对人名称提取名为的实体,并且使用名称索引,为用户提供一个重要的工具来查找报纸页面中的名称。最先进的系统被认为使用了第一和第二HAREM的金色集合,分别作为葡萄牙语中的名为实体识别系统的参考,作为培训和测试集。此外,我们从实际报纸的文章中创建了一个新的培训数据集。在这种情况下,我们处理了100篇报纸文章,并设法正确查找了87.0%的现存名称及其各自的部分引用。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号