首页> 外文会议>International conference on web information systems and technologies >FactRunner: A New System for NLP-Based Information Extraction from Wikipedia
【24h】

FactRunner: A New System for NLP-Based Information Extraction from Wikipedia

机译:FactRunner:从Wikipedia提取基于NLP的信息的新系统

获取原文

摘要

Wikipedia is playing an increasing role as a source of human-readable knowledge, because it contains an enormous amount of high quality information written by human authors. Finding a relevant piece of information in this huge collection of natural language text is often a time-consuming process, as a keyword-based search interface is the main method for querying. Therefore, an iterative process to explore the document collection to find the information of interest is required. In this paper, we present an approach to extract structured information from unstructured documents to enable structured queries. Information Extraction (IE) systems have been proposed for this tasks, but due to the complexity of natural language, they often produce unsatisfying results. As Wikipedia contains, in addition to the plain natural language text, links between documents and other metadata, we propose an approach which exploits this information to extract more accurate structured information. Our proposed system FactRunner focusses on extracting structured information from sentences containing such links, because the links may indicate more accurate information than other sentences. We evaluated our system with a subset of documents from Wikipedia and compared the results with another existing system. The results show that a natural language parser combined with Wikipedia markup can be exploited for extracting facts in form of triple statements with a high accuracy.
机译:维基百科作为人类可读知识的来源,正发挥着越来越重要的作用,因为它包含了大量由人类撰写的高质量信息。在庞大的自然语言文本集中查找相关信息通常是一个耗时的过程,因为基于关键字的搜索界面是查询的主要方法。因此,需要一个迭代的过程来探索文档集合以找到感兴趣的信息。在本文中,我们提出了一种从非结构化文档中提取结构化信息以启用结构化查询的方法。已经提出了用于此任务的信息提取(IE)系统,但是由于自然语言的复杂性,它们通常会产生不令人满意的结果。由于Wikipedia除了普通的自然语言文本外,还包含文档和其他元数据之间的链接,因此我们提出了一种利用此信息来提取更准确的结构化信息的方法。我们提出的系统FactRunner专注于从包含此类链接的句子中提取结构化信息,因为这些链接可能比其他句子表示更准确的信息。我们使用Wikipedia的一部分文档评估了我们的系统,并将结果与​​另一个现有系统进行了比较。结果表明,可以结合使用自然语言解析器和Wikipedia标记来以三重语句的形式高精度提取事实。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号