首页> 外文学位 >A framework for the integration of information retrieval and parse tree database with applications in the genomics domain.
【24h】

A framework for the integration of information retrieval and parse tree database with applications in the genomics domain.

机译:信息检索和解析树数据库与基因组学领域的应用程序集成的框架。

获取原文
获取原文并翻译 | 示例

摘要

With the ever increasing number of biomedical articles, keeping up with new information has become a big challenge for biomedical researchers. Much of the information biologists need resides in semi-structured biomedical text articles, making it difficult for researchers to realize the full benefits of these findings. Information retrieval (IR) and information extraction (IE) have been the central technologies for seeking information from large corpora of unstructured text. Advances in these technologies can have a direct impact to the research methodologies for research areas such as biomedical research.;In this thesis, these issues in IR and IE are tackled by proposing a novel framework called IR+PTQL. The core idea of the framework is to model and store the syntactic and semantic information of the text corpora in a specialized database called the parse tree database. Extraction is then expressed in the form of database queries. A core component is the automated query generation that generates extraction patterns without training data. The evaluation results demonstrate that the query generation component contributes positively to the performance of IR and IE. The applicability of the framework is illustrated with various applications in the genomics domain.;While the fields of IR and IE have matured in the past decade, current technologies still have yet to fulfill the promise of supporting biomedical research. In particular, traditional IE technologies adopt a 'black-box' approach, in which biologists have no means in expressing their extraction needs. In addition, typical automated IE technologies rely on manually curated data to learn syntactic patterns for extraction. However, curation of such data is known to be labor-intensive, limiting the applicability of IE in the biomedical domain. While there have been successes in utilizing linguistic structures for IE, linguistic structures have yet to be adopted in the current technologies for IR. Syntactic parsing over large corpus of text is known to be computationally expensive, and this is not ideal for IR, which is expected to respond to users in a timely manner. However, the lack of usage of linguistic structures leads to suboptimal performance for certain queries in the biomedical domain.
机译:随着生物医学文章数量的不断增加,跟上新信息已成为生物医学研究人员的一大挑战。生物学家需要的大部分信息都位于半结构的生物医学文本文章中,这使研究人员难以意识到这些发现的全部益处。信息检索(IR)和信息提取(IE)一直是从大量非结构化文本中查找信息的中心技术。这些技术的进步可以直接影响诸如生物医学研究等研究领域的研究方法。本文通过提出一个新颖的框架IR + PTQL来解决IR和IE中的这些问题。该框架的核心思想是将文本语料库的句法和语义信息建模并存储在称为解析树数据库的专用数据库中。然后以数据库查询的形式表示提取。核心组件是自动查询生成,无需生成训练数据即可生成提取模式。评估结果表明,查询生成组件对IR和IE的性能有积极贡献。该框架的适用性在基因组学领域的各种应用中得到了说明。尽管IR和IE的领域在过去十年中已经成熟,但当前的技术仍未实现支持生物医学研究的希望。特别是,传统的IE技术采用“黑匣子”方法,生物学家无法表达其提取需求。此外,典型的自动化IE技术依靠手动整理的数据来学习提取的句法模式。但是,已知此类数据的整理需要大量劳动,从而限制了IE在生物医学领域的适用性。尽管已经成功地将语言结构用于IE,但是语言结构尚未在当前的IR技术中采用。众所周知,对大型文本语料库进行语法分析在计算上是昂贵的,并且对于IR而言并不理想,因为IR需要及时响应用户。但是,缺乏语言结构的使用会导致生物医​​学领域中某些查询的性能欠佳。

著录项

  • 作者

    Ng Tari, Luis Babaji.;

  • 作者单位

    Arizona State University.;

  • 授予单位 Arizona State University.;
  • 学科 Computer Science.
  • 学位 Ph.D.
  • 年度 2009
  • 页码 297 p.
  • 总页数 297
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

  • 入库时间 2022-08-17 11:38:17

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号