【24h】

Entity Categorization Over Large Document Collections

机译:大型文档集合中的实体分类

获取原文

摘要

Extracting entities (such as people, movies) from documents and identifying the categories (such as painter, writer) they belong to enable structured querying and data analysis over unstructured document collections. In this paper, we focus on the problem of categorizing extracted entities. Most prior approaches developed for this task only analyzed the local document context within which entities occur. In this paper, we significantly improve the accuracy of entity categorization by (ⅰ) considering an entity's context across multiple documents containing it, and (ⅱ) exploiting existing large lists of related entities (e.g., lists of actors, directors, books). These approaches introduce computational challenges because (a) the context of entities has to be aggregated across several documents and (b) the lists of related entities may be very large. We develop techniques to address these challenges. We present a thorough experimental study on real data sets that demonstrates the increase in accuracy and the scalability of our approaches.
机译:从文档中提取实体(例如人物,电影)并标识它们所属的类别(例如画家,作家),从而可以对非结构化文档集合进行结构化查询和数据分析。在本文中,我们着重于对提取的实体进行分类的问题。为此任务开发的大多数现有方法仅分析实体在其中发生的本地文档上下文。在本文中,我们通过(ⅰ)考虑包含多个文档的实体的上下文,以及(ⅱ)利用现有的大量相关实体列表(例如演员,导演,书籍列表)来显着提高实体分类的准确性。这些方法带来了计算上的挑战,因为(a)实体的上下文必须在多个文档中汇总,并且(b)相关实体的列表可能非常大。我们开发技术来应对这些挑战。我们对真实数据集进行了全面的实验研究,证明了我们方法的准确性和可扩展性不断提高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号