Scaling Up Text Classification for Large File Systems

机译：扩大大型文件系统的文本分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

We combine the speed and scalability of infonnation retrieval with the generally superior classification accuracy offered by machine learning, yielding a two-phase text classifier that can scale to very large document corpora. We investigate the effect of different methods of formulating the query from the training set, as well as varying the query size. In empirical tests on the Reuters RCV1 corpus of 806,000 documents, we find runtime was easily reduced by a factor of 27x, with a somewhat surprising gain in F-measure compared with traditional text classification.

机译：我们将信息检索的速度和可扩展性与机器学习提供的通常更高的分类精度相结合，从而产生了两阶段的文本分类器，可以扩展到非常大的文档语料库。我们调查了从训练集中制定查询的不同方法以及更改查询大小的影响。在对806,000个文档的Reuters RCV1语料库进行的经验测试中，我们发现运行时间很容易减少了27倍，与传统的文本分类相比，F-measure的收益有些令人惊讶。

著录项

来源
《ACMKDD International Conference on Knowledge Discovery and Data Mining;KDD 2008》|2008年|221-228|共8页
会议地点
作者
George Forman; Shyamsundar Rajaram;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类信息与知识传播;
关键词
machine learning; text classification; document categorization; information retrieval; enterprise scalability; forensic search;

机译：机器学习;文本分类;文档分类;信息检索;企业可扩展性;取证搜索;

相似文献

外文文献
中文文献
专利

1. Automatic Tag Attachment Scheme based on Text Clustering for Efficient File Search in Unstructured Peer-to-Peer File Sharing Systems [J] . Ting Ting Qin, Satoshi Fujita Journal of Universal Computer Science . 2012,第8期

机译：非文本对等文件共享系统中基于文本聚类的标签自动标记方案
2. Identification and classification of DICOM files with burned-in text content [J] . Vcelak Petr, Kryl Martin, Kratochvil Michal, International journal of medical informatics . 2019,第JUNa期

机译：带有内置文本内容的DICOM文件的识别和分类
3. Text classification: Classifying plain source files with neural network [J] . Veber Jaromir Journal of Systems Integration . 2010,第4期

机译：文本分类：使用神经网络对普通源文件进行分类
4. Scaling up text classification for large file systems [C] . George Forman, Shyamsundar Rajaram ACM SIGKDD international conference on Knowledge discovery and data mining . 2008

机译：扩大大型文件系统的文本分类
5. Scalable file systems and operating systems support for big data applications. [D] . Xu, Lei. 2014

机译：可扩展的文件系统和操作系统支持大数据应用程序。
6. Text mining for the Vaccine Adverse Event Reporting System: medical text classification using informative feature selection [O] . Taxiarchis Botsis, Michael D Nguyen, Emily Jane Woo, 2011

机译：疫苗不良事件报告系统的文本挖掘：使用信息特征选择进行医学文本分类
7. Scalable full-text search for petascale file systems [O] . Andrew W. Leung, Ethan L. Miller 2010

机译：用于petascale文件系统的可扩展全文本搜索

Scaling Up Text Classification for Large File Systems

摘要

著录项

相似文献

相关主题

期刊订阅