Toward Computer-Assisted Text Curation: Classification Is Easy (Choosing Training Data Can Be Hard...)

机译：对计算机辅助文本策策：分类很容易（选择培训数据可能很难......）

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We aim to design a system for classifying scientific articles based on the presence of protein characterization experiments, intending to aid the curators populating JCVI's Characterized Protein (CHAR) Database of experimentally characterized protein s. We trained two classifiers using small datasets labeled by CHAR curators, and another classifier based on a much larger dataset using annotations from public databases. Performance varied greatly, in ways we did not anticipate. We describe the datasets, the classification method, and discuss the unexpected results.

机译：我们的目的是根据存在蛋白质表征实验的存在，设计一种分类科学制品的系统，意图帮助填充实验表征蛋白质S的JCVI所表征蛋白质（Char）数据库的助助剂。我们使用CHAR策展人标记的小型数据集培训了两个分类器，以及另一个基于来自公共数据库的注释的更大数据集的分类器。性能很大，我们没有预料到。我们描述了数据集，分类方法，并讨论了意外结果。

著录项

来源
《Workshop of the BioLink Special Interest Group on Linking Literature,Information and Knowledge for Biology》|2010年||共10页
会议地点
作者
Robert Denroche; Ramana Madupu; Shibu Yooseph; Granger Sutton; Hagit Shatkay;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP3-53;
关键词
Classification; Biomedical Text Mining; Text Categorization; Database Curation; Imbalanced and Sparse Data;

机译：分类;生物医学文本挖掘;文本分类;数据库策策;不平衡和稀疏数据;

相似文献

外文文献
中文文献
专利

1. On the influence of training data quality on text document classification using machine learning methods [J] . Jyri Saarikoski, Henry Joutsijoki, Kalervo Jaervelin, International Journal of Knowledge Engineering and Data Mining . 2015,第2期

机译：训练数据质量对机器学习方法对文本文档分类的影响
2. Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data [J] . Fatma Howedi, Masnizah Mohd Computer Engineering and Intelligent Systems . 2014,第4期

机译：使用朴素贝叶斯分类器和有限的训练数据对作者归属进行文本分类
3. Text Classification for Authorship Attribution Using Naive Bayes Classifier with Limited Training Data [J] . Fatma Howedi, Masnizah Mohd Journal of Economics and Sustainable Development . 2014,第4期

机译：使用朴素贝叶斯分类器和有限的训练数据对作者归属进行文本分类
4. Toward Computer-Assisted Text Curation: Classification Is Easy (Choosing Training Data Can Be Hard...) [C] . Robert Denroche, Ramana Madupu, Shibu Yooseph, Workshop of the BioLink Special Interest Group on Linking Literature,Information and Knowledge for Biology . 2010

机译：对计算机辅助文本策策：分类很容易（选择培训数据可能很难......）
5. Incorporate Out-of-Vocabulary Words for Psycholinguistic Analysis using Social Media Texts - An OOV-Aware Data Curation Process and a Hybrid Approach [D] . Liu, Kun. 2021

机译：利用社交媒体文本融入了词汇语言学分析的失语单词 - OOV感知数据策委和混合方法
6. Biomedical text summarization to support genetic database curation: using Semantic MEDLINE to create a secondary database of genetic information [O] . T. Elizabeth Workman, Marcelo Fiszman, John F Hurdle, 2010

机译：生物医学文本摘要以支持遗传数据库管理：使用语义MEDLINE创建遗传信息的辅助数据库
7. Training and Prediction Data Discrepancies: Challenges of Text Classification with Noisy, Historical Data [O] . R. Andrew Kreek, Emilia Apostolova 2018

机译：培训和预测数据差异：文本分类与嘈杂，历史数据的挑战
8. Computer-Assisted, Programmed Text, and Lecture Modes of Instruction in Three Medical Training Courses: Comparative Evaluation [R] . Deignan, G. M., Seager, B. R., Kimball, M., 1980

机译：三种医学培训课程的计算机辅助，程序化教学和讲授教学模式：比较评估

Toward Computer-Assisted Text Curation: Classification Is Easy (Choosing Training Data Can Be Hard...)

摘要

著录项

相似文献

相关主题

期刊订阅