首页> 外文会议>International conference on computational processing of portuguese >Indexing Names of Persons in a Large Dataset of a Newspaper
【24h】

Indexing Names of Persons in a Large Dataset of a Newspaper

机译:在大型报纸数据集中索引人名

获取原文

摘要

An index is a very good tool for finding the necessary information from a set of documents. So far, the extant index tools in both the printed and digital newspaper versions are not sufficient to help users find information. Users must browse the entire newspaper to fulfill their needs or discover later on, after spending a considerable amount of energy, that the information they had been seeking is not available. We propose here to use state-of-the-art strategies for extracting named entities specifically for person names and, with an index of names, provide the user with an important tool to find names within newspaper pages. The state-of-the-art system considered used the Golden Collection of the First and Second HAREM, a reference for Named Entity Recognition systems in Portuguese, as training and test sets respectively. Furthermore, we created a new training dataset from the actual newspaper's articles. In this case, we processed 100 articles of the newspaper and managed to correctly find 87.0% of the extant names and their respective partial citations.
机译:索引是从一组文档中查找必要信息的非常好的工具。到目前为止,印刷和数字报纸版本中现有的索引工具还不足以帮助用户查找信息。用户必须浏览整份报纸来满足他们的需求,或者在花费大量精力之后发现他们所寻找的信息不可用。我们在这里建议使用最先进的策略来提取专门用于人物姓名的命名实体,并使用姓名索引为用户提供重要的工具,以在报纸页面中查找姓名。所考虑的最先进的系统分别使用了第一和第二HAREM的黄金收藏(分别是葡萄牙语的命名实体识别系统的参考)作为训练集和测试集。此外,我们根据实际报纸的文章创建了一个新的培训数据集。在这种情况下,我们处理了100篇报纸文章,并设法正确找到87.0%的现存名称及其各自的部分引文。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号