首页> 外文会议>International Conference on Information Retrieval and Knowledge Management >Information extraction from semi-structured and un-structured documents using probabilistic context free grammar inference
【24h】

Information extraction from semi-structured and un-structured documents using probabilistic context free grammar inference

机译:使用概率背景自由语法推断从半结构化和未结构化文件中提取信息

获取原文

摘要

Large number of research papers are available in the form of un-structured (text) format. Knowledge discovery in un-structured document has been recognized as promising task. These documents are typically formatted for human viewing, which varies widely from document to document. Frequent change in their formatting causes difficulties in constructing a global schema. Thus, discovery of interesting rules from it is a complex and tedious process. Recently, conditional random fields (CRFs) and hand-coded wrappers have been used to label the text (such as Title, Author Name(s), Affiliation, Email, Contact number, etc. in research papers). In this paper we propose a novel hybrid approach to infer grammar rules using alignment similarity and probabilistic context free grammar. It helps in extracting desired information from the document.
机译:大量的研究论文以未结构化(文本)格式的形式提供。 未结构化文件中的知识发现已被认为是有前途的任务。 这些文档通常是用于人类观察的格式化,这些文件从文档中广泛变化。 他们的格式频繁变化会导致构建全局模式的困难。 因此,从它那里发现有趣的规则是一个复杂和繁琐的过程。 最近,有条件的随机字段(CRF)和手工编码包装器已被用于标记文本(例如在研究论文中标记文本(例如标题,作者名称,电子邮件,联系号码等)。 在本文中,我们提出了一种新的混合方法来使用对齐相似性和概率背景自由语法推断语法规则。 它有助于从文档中提取所需信息。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号