首页> 外文会议>Systems and Information Engineering Design Symposium >Supervised Machine Learning and Deep Learning Classification Techniques to Identify Scholarly and Research Content
【24h】

Supervised Machine Learning and Deep Learning Classification Techniques to Identify Scholarly and Research Content

机译:监督机器学习和深度学习分类技术,用于识别学术和研究内容

获取原文

摘要

The Internet Archive (IA), one of the largest open-access digital libraries, offers 28 million books and texts as part of its effort to provide an open, comprehensive digital library. As it organizes its archive to support increased accessibility of scholarly content to support research, it confronts both a need to efficiently identify and organize academic documents and to ensure an inclusive corpus of scholarly work that reflects a "long tail distribution," ranging from high-visibility, frequently-accessed documents to documents with low visibility and usage. At the same time, it is important to ensure that artifacts labeled as research meet widely-accepted criteria and standards of rigor for research or academic work to maintain the credibility of that collection as a legitimate repository for scholarship. Our project identifies effective supervised machine learning and deep learning classification techniques to quickly and correctly identify research products, while also ensuring inclusivity along the entire long-tail spectrum. Using data extraction and feature engineering techniques, we identify lexical and structural features such as number of pages, size, and keywords that indicate structure and content that conforms to research product criteria. We compare performance among machine learning classification algorithms and identify an efficient set of visual and linguistic features for accurate identification, and then use image classification for more challenging cases, particularly for papers written in non-Romance languages. We use a large dataset of PDF files from the Internet Archive, but our research offers broader implications for library science and information retrieval. We hypothesize that key lexical markers and visual document dimensions, extracted through PDF parsing and feature engineering as part of data processing, can be efficiently extracted from a corpus of documents and combined effectively for a high level of accurate classification.
机译:Internet Archive(IA)是最大的开放式数字图书馆之一,提供2800万本书和文本,作为其提供开放,全面的数字图书馆的努力的一部分。由于它组织其存档来支持学术内容的增加可达性,以支持研究,它面临有效的需要有效识别和组织学术文件,并确保反映“长尾分布”的学术作品的包容性语料库,从高高的范围内可见性,常见的文档到具有低可见性和使用情况的文档。同时,重要的是要确保标记为研究的文物,满足研究或学术工作的严谨性标准和标准,以保持该系列作为合法奖学金的合法储存的可信度。我们的项目识别有效的监督机器学习和深度学习分类技术,以快速和正确识别研究产品,同时还确保沿整个长尾谱的包容性。使用数据提取和功能工程技术,我们确定了表明页面,大小和关键字的词汇和结构特征,表示符合研究产品标准的结构和内容。我们比较机器学习分类算法之间的性能,并确定有效的视觉和语言特征,以便准确识别,然后使用图像分类进行更具挑战性的情况,特别是以非浪漫语言编写的论文。我们使用来自Internet档案的PDF文件的大型数据集,但我们的研究对图书馆学和信息检索提供了更广泛的影响。我们假设通过PDF解析和特征工程提取的关键词汇标记和视觉文档尺寸作为数据处理的一部分,可以从文档语料库有效地提取,并有效地用于高级准确分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号