首页> 外文期刊>BMC Bioinformatics >The structural and content aspects of abstracts versus bodies of full text journal articles are different
【24h】

The structural and content aspects of abstracts versus bodies of full text journal articles are different

机译:摘要与全文期刊文章正文的结构和内容方面有所不同

获取原文
           

摘要

Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. Results We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. Conclusions Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.
机译:背景技术期刊全文全文工作的增加和PubMedCentral的发展,有机会在生物医学文本挖掘的完成方式上产生重大的范式转变。但是,到目前为止,还没有全面描述全文期刊文章的正文与迄今为止作为大多数生物医学文本挖掘研究主题的摘要的区别。结果我们检查了摘要和全文文章正文的结构和语言方面,两者上的文本挖掘工具的性能,以及它们之间各种命名实体的语义类别的分布。我们发现了明显的结构差异,文章正文中的句子更长,正文中带括号的材料比摘要中的使用更多。我们发现在语言功能方面的内容差异。在我们研究的语言特征中,有四分之三在统计学上显着不同地分布在两种体裁之间。我们还发现了在语义特征分布方面的内容差异。四个语义类别中的三个语义类别每千个单词的密度存在显着差异,并且在两种类型中它们出现的程度也存在明显差异。关于文本挖掘工具的性能,我们发现突变发现器在两种类型中的表现均一样好,但是与文章摘要相比,各种各样的基因提及系统在文章正文上的表现要差得多。 POS标记在摘要中比在文章正文中更准确。结论在文章摘要和文章正文之间,结构和内容方面存在显着差异。随着文本挖掘领域更多地进入处理全文文章的领域,许多这些差异可能会带来问题。但是,这些差异也为提取数据类型提供了很多机会,特别是在带括号的文本中找到的数据类型,这些数据存在于文章正文中,而没有出现在文章摘要中。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号