Supervised Machine Learning and Deep Learning Classification Techniques to Identify Scholarly and Research Content

机译：监督机器学习和深度学习分类技术，用于识别学术和研究内容

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The Internet Archive (IA), one of the largest open-access digital libraries, offers 28 million books and texts as part of its effort to provide an open, comprehensive digital library. As it organizes its archive to support increased accessibility of scholarly content to support research, it confronts both a need to efficiently identify and organize academic documents and to ensure an inclusive corpus of scholarly work that reflects a "long tail distribution," ranging from high-visibility, frequently-accessed documents to documents with low visibility and usage. At the same time, it is important to ensure that artifacts labeled as research meet widely-accepted criteria and standards of rigor for research or academic work to maintain the credibility of that collection as a legitimate repository for scholarship. Our project identifies effective supervised machine learning and deep learning classification techniques to quickly and correctly identify research products, while also ensuring inclusivity along the entire long-tail spectrum. Using data extraction and feature engineering techniques, we identify lexical and structural features such as number of pages, size, and keywords that indicate structure and content that conforms to research product criteria. We compare performance among machine learning classification algorithms and identify an efficient set of visual and linguistic features for accurate identification, and then use image classification for more challenging cases, particularly for papers written in non-Romance languages. We use a large dataset of PDF files from the Internet Archive, but our research offers broader implications for library science and information retrieval. We hypothesize that key lexical markers and visual document dimensions, extracted through PDF parsing and feature engineering as part of data processing, can be efficiently extracted from a corpus of documents and combined effectively for a high level of accurate classification.

机译：Internet Archive（IA）是最大的开放式数字图书馆之一，提供2800万本书和文本，作为其提供开放，全面的数字图书馆的努力的一部分。由于它组织其存档来支持学术内容的增加可达性，以支持研究，它面临有效的需要有效识别和组织学术文件，并确保反映“长尾分布”的学术作品的包容性语料库，从高高的范围内可见性，常见的文档到具有低可见性和使用情况的文档。同时，重要的是要确保标记为研究的文物，满足研究或学术工作的严谨性标准和标准，以保持该系列作为合法奖学金的合法储存的可信度。我们的项目识别有效的监督机器学习和深度学习分类技术，以快速和正确识别研究产品，同时还确保沿整个长尾谱的包容性。使用数据提取和功能工程技术，我们确定了表明页面，大小和关键字的词汇和结构特征，表示符合研究产品标准的结构和内容。我们比较机器学习分类算法之间的性能，并确定有效的视觉和语言特征，以便准确识别，然后使用图像分类进行更具挑战性的情况，特别是以非浪漫语言编写的论文。我们使用来自Internet档案的PDF文件的大型数据集，但我们的研究对图书馆学和信息检索提供了更广泛的影响。我们假设通过PDF解析和特征工程提取的关键词汇标记和视觉文档尺寸作为数据处理的一部分，可以从文档语料库有效地提取，并有效地用于高级准确分类。

著录项

来源
《Systems and Information Engineering Design Symposium》|2021年|1-6|共6页
会议地点
作者
Huilin Chang; Yihnew Eshetu; Celeste Lemrow;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Deep learning; Visualization; Transfer learning; Feature extraction; Portable document format; Libraries; Internet;

机译：深入学习;可视化;转移学习;特征提取;便携式文件格式;图书馆;互联网;

相似文献

外文文献
中文文献
专利

1. Size Classification of Tomato Fruit Using Thresholding, Machine Learning, and Deep Learning Techniques [J] . Robert G. de Luna, Elmer P. Dadios, Argel A. Bandala, Agrivita: journal of agricultural science . 2019,第3期

机译：番茄果实使用阈值，机器学习和深层学习技术进行大小分类
2. A Predictive Model to identify possible affected Bipolar disorder students using Naive Baye’s, Random Forest and SVM machine learning techniques of data mining and Building a Sequential Deep Learning Model using Keras [J] . S. Peerbasha, M. Mohamed Surputheen, Associate Professor International journal of computer science and network security . 2021,第5期

机译：使用Naive Baye，随机森林和SVM机器学习技术来确定可能影响双相障碍学生的预测模型，数据挖掘和建立一个使用Keras的顺序深度学习模型
3. Identifying click baits using various machine learning and deep learning techniques [J] . Sohom Ghosh International Journal of Information Technology . 2021,第3期

机译：使用各种机器学习和深度学习技术识别单击Baits
4. A State of Art Techniques on Machine Learning Algorithms: A Perspective of Supervised Learning Approaches in Data Classification [C] . R. Saravanan, Pothula Sujatha International Conference on Intelligent Computing and Control Systems . 2018

机译：机器学习算法的最新技术：数据分类中的监督学习方法
5. Semi-Supervised Machine Learning Techniques for Classification of Evolving Data in Pattern Recognition =TECHNIQUES SEMI-SUPERVISéES D'APPRENTISSAGE MACHINE POUR LA CLASSIFICATION DES DONNéES EN éVOLUTION EN RECONNAISSANCE DE FORMES [D] . Tencer, Lukas. 2017

机译：半监督机器学习技术，用于模式识别中不断发展的数据分类=在表单识别中对数据进行分类的半监督机器学习技术
6. Classification Techniques for Cardio-Vascular Diseases Using Supervised Machine Learning [O] . John Minou, John Mantas, Flora Malamateniou, 2020

机译：使用监督机器学习的心血管疾病分类技术
7. SPAM CLASSIFICATION BASED ON SUPERVISED LEARNING USING MACHINE LEARNING TECHNIQUES [O] . T. Hamsapriya, D. Karthika Renuka, M. Raja Chakkaravarthi 2011

机译：基于机器学习技术的监督学习的垃圾邮件分类

Supervised Machine Learning and Deep Learning Classification Techniques to Identify Scholarly and Research Content

摘要

著录项

相似文献

相关主题

期刊订阅