Entity Categorization Over Large Document Collections

机译：大型文档集合中的实体分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (ⅰ) considering an entity's context across multiple documents containing it, and (ⅱ) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approaches.

机译：从文档中提取实体（例如人物，电影）并标识它们所属的类别（例如画家，作家），从而可以对非结构化文档集合进行结构化查询和数据分析。在本文中，我们着重于对提取的实体进行分类的问题。为此任务开发的大多数现有方法仅分析实体在其中发生的本地文档上下文。在本文中，我们通过（ⅰ）考虑包含多个文档的实体的上下文，以及（ⅱ）利用现有的大量相关实体列表（例如演员，导演，书籍列表）来显着提高实体分类的准确性。这些方法带来了计算上的挑战，因为（a）实体的上下文必须在多个文档中汇总，并且（b）相关实体的列表可能非常大。我们开发技术来应对这些挑战。我们对真实数据集进行了全面的实验研究，证明了我们方法的准确性和可扩展性不断提高。

著录项

来源
《ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008》|2008年|256-264|共9页
会议地点
作者
Venkatesh Ganti; Arnd Christian Konig; Rares Vernica;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息与知识传播;
关键词
algorithms; performance; experimentation;

机译：算法;性能;实验;

相似文献

外文文献
中文文献
专利

1. Categorization of Occupation in Documented Skeletal Collections: Its Relevance for the Interpretation of Activity-Related Osseous Changes [J] . Perréard Lopreno G., Alves Cardoso F., Assis S., International journal of osteoarchaeology . 2013,第2期

机译：文献收集的骨骼中职业的分类：与活动相关的骨变的解释的相关性
2. The Potential of IFLA LRM and RDA Key Entities for Identification of Entities in Textual Documents of Cultural Heritage: The RunA Collection [J] . Anita Rasmane, Anita Goldberga Cataloging & classification quarterly . 2020,第5a8期

机译：IFLA LRM和RDA关键实体的潜力，用于识别文化遗产文本文档中的实体：RUNA集合
3. N£Rank+: a graph-based approach for entity ranking in document collections [J] . Wang Chengyu, Zhou Guomin, He Xiaofeng, Frontiers of computer science in China . 2018,第3期

机译：N£Rank +：用于文档集合中实体排名的基于图的方法
4. Entity categorization over large document collections [C] . Venkatesh Ganti, Arnd C. Konig, Rares Vernica ACM SIGKDD international conference on Knowledge discovery and data mining . 2008

机译：大型文档集合的实体分类
5. Tracking Topical Evolution in Large Document Collections [D] . Naim, Sheikh Motahar. 2018

机译：跟踪大型文档集中的主题演变
6. A document processing pipeline for annotating chemical entities in scientific documents [O] . David Campos, Sérgio Matos, José L Oliveira 2015

机译：用于在科学文件中注释化学实体的文件处理管道
7. Entity Categorization Over Large Document Collections [O] . Venkatesh Ganti, Arnd Christian König, Rares Vernica 2011

机译：大型文档集合中的实体分类

Entity Categorization Over Large Document Collections

摘要

著录项

相似文献

相关主题

期刊订阅