首页> 外文学位 >An Architecture For Multimodal Information Extraction From Scholarly Documents
【24h】

An Architecture For Multimodal Information Extraction From Scholarly Documents

机译:从学术文献中提取多峰信息的体系结构

获取原文
获取原文并翻译 | 示例

摘要

A scholarly paper (journal article, conference proceeding) has both unstructured (text) and semi-structured data sources (tables and figures). An experimental figure such as a line graph is generated from a data table that stores the results of an experiment. Typically that data table is not reported in the paper, hence can not be queried directly. Similarly, a scholarly table reports the results of an experiment but is not structured enough to support anything more than a keyword query.;This dissertation has two contributions. First, we show methods to reduce these semi-structured data sources to structured content that can support factoid queries such as "What is the best precision for Imagenet classification task?" or "What is the best BLEU score for English to Arabic translation?";For the scholarly figures, we report an end to end system. First, we report a batch extractor to extract all figures (including vector graphics) and associated metadata from a document with 81% and 87% accuracy. Next, we report image processing algorithms to detect compound figures with 82% accuracy and classify non-compound figures as line graphs or bar charts with 84% average accuracy. We improve the accuracy for text extraction from raster graphics by 39% and show algorithms to classify the text inside the plots with an average accuracy of 90%. The majority of figures in computer science papers are embedded as vector graphics. While previous work has always extracted them as raster graphics, we show methods to extract them in a vector graphics format, which allows us to scalably separate curves in line graphs with 75% average accuracy. This reduces a line graph to the original data points from which it was generated, allowing the factoid queries. We report a similar architecture for scholarly tables that can reduce the tables to data based triples supporting similar queries.;Finally, we show supervised methods to extract scholarly entities from the text of the paper. Specifically, we show that a non-sequential classifier learning the informativeness of a phrase globally and a sequential classifier learning the same utilizing the local context can be combined to improve the accuracy of the process.
机译:一篇学术论文(期刊论文,会议记录)具有非结构化(文本)和半结构化数据源(表格和图表)。从存储实验结果的数据表中生成实验图(例如折线图)。通常,该数据表未在论文中报告,因此无法直接查询。类似地,一个学术表格报告了一个实验的结果,但是其结构却不足以支持除关键字查询之外的其他任何内容。首先,我们展示了将这些半结构化数据源简化为可以支持事实查询的结构化内容的方法,例如“ Imagenet分类任务的最佳精度是多少?”或“英语到阿拉伯语翻译的最佳BLEU分数是多少?”;对于学术人物,我们报告了一个端到端系统。首先,我们报告一个批处理提取器,以81%和87%的准确性从文档中提取所有图形(包括矢量图形)和相关的元数据。接下来,我们将报告图像处理算法,以检测精度为82%的复合图形,并将非复合图形分类为线形图或条形图,平均精度为84%。我们将从栅格图形中提取文本的准确性提高了39%,并展示了算法以90%的平均准确性对绘图内的文本进行分类。计算机科学论文中的大多数图形都作为矢量图形嵌入。尽管以前的工作始终将它们提取为栅格图形,但我们展示了以矢量图形格式提取它们的方法,这使我们能够按比例缩放线形图中的曲线,平均精度为75%。这样可以将折线图缩减为生成折线图的原始数据点,从而允许进行事实查询。我们报告了一种学术表的相似体系结构,该体系结构可以将表简化为支持类似查询的基于数据的三元组。最后,我们展示了从论文正文中提取学术实体的监督方法。具体来说,我们表明,可以将学习全局短语信息的非顺序分类器和利用局部上下文学习相同短语的顺序分类器组合在一起,以提高过程的准确性。

著录项

  • 作者

    Choudhury, Sagnik Ray.;

  • 作者单位

    The Pennsylvania State University.;

  • 授予单位 The Pennsylvania State University.;
  • 学科 Information science.;Computer science.
  • 学位 Ph.D.
  • 年度 2017
  • 页码 133 p.
  • 总页数 133
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号