Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm

机译：人群采购作为N克文本文档分类算法的改进

获取原文

获取外文期刊封面目录资料

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

A common task in a world of natural language processing is text classification useful for e.g. spam filters, documents sorting, science articles classification or plagiarism detection. This can still be done best and most accurately by human, on the other hand, we can of ten accept certain error in the classification in exchange for its speed. Here, natural language processing mechanism transforms the text in natural language to a form understandable by a classifier such as K-Nearest Neighbour, Decision Trees, Artificial Neural Network or Support Vector Machines. We can also use this human element to help automated classification to improve its accuracy by means of crowdsourcing. This work deals with classification of text documents and its improvement through crowdsourcing. Its goal is to design and implement text documents classifier prototype based on documents similarity and to design evaluation and crowdsourcing-based classification improvement mechanism. For classification the N-grams algorithm has been chosen, which was implemented in Java. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful. We have tested our approach on two data sets with promising preliminary results even across different languages. This led to a real-world implementation started at the beginning of 2019 in cooperation of two universities: VšB-TUO and OSU.

机译：在自然语言处理领域中，常见的任务是对例如垃圾邮件过滤器，文档分类，科学文章分类或窃检测。另一方面，这仍然可以由人类最好，最准确地完成，另一方面，我们十个人可以接受分类中的某些错误，以换取其速度。在这里，自然语言处理机制将自然语言中的文本转换为分类器（例如K最近邻居，决策树，人工神经网络或支持向量机）可以理解的形式。我们还可以使用这种人为因素通过众包来帮助自动分类以提高其准确性。这项工作涉及文本文档的分类及其通过众包进行的改进。它的目标是设计和实现基于文档相似性的文本文档分类器原型，并设计评估和基于众包的分类改进机制。为了分类，选择了用Java实现的N-grams算法。使用CMS WordPress创建了用于众包的界面。除数据收集外，界面的目的还在于评估分类准确性，从而扩展了分类器测试数据集，从而使分类更加成功。我们已经在两个数据集上测试了我们的方法，即使在不同语言之间，它们也都具有令人鼓舞的初步结果。这导致了由VšB-TUO和OSU两所大学合作在2019年初开始的实际实施。

著录项

来源
《International Workshop on Semantic and Social Media Adaptation and Personalization》|2020年|1-6|共6页
会议地点
作者
Petr Šaloun; David Andršič; Barbora Cigánková; Ioannis Anagnostopoulos;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Classification; text documents; natural language processing; documents similarity; N-grams; crowdsourcing; WordPress; Java; PHP;

机译：分类;文本文档;自然语言处理;文档相似度; N-grams;众包; WordPress; Java; PHP;

相似文献

外文文献
中文文献
专利

1. Weights Space Exploration Using Genetic Algorithms for Meta-classifier in Text Document Classification [J] . Radu G. CREJULESCU, Daniel I. MORARIU, Macarie BREAZU, Studies in Informatics and Control . 2012,第2期

机译：文本文档分类中基于遗传算法的元分类器加权空间探索
2. Comparative Study of Five Text Classification Algorithms with their Improvements [J] . Ahmed H. Aliwy, Esraa H. Abdul Ameer International Journal of Applied Engineering Research . 2017,第14aPta2期

机译：五种文本分类算法及其改进的比较研究
3. Algorithm based on modified angle‐based outlier factor for open‐set classification of text documents [J] . Walkowiak Tomasz, Datko Szymon, Maciejewski Henryk Applied stochastic models in business and industry . 2018,第5期

机译：基于修改角度的文本文档分类的基于角度的异常因素的算法
4. Classification of Text Documents based on Naive Bayes using N-Gram Features [C] . Mehmet BAYGIN International Conference on Artificial Intelligence and Data Processing . 2018

机译：使用N-Gram功能基于朴素贝叶斯对文本文档进行分类
5. Edits Based Categorization of Crowd Sourced Document Corpora with Application to Wikipedia [D] . Fang, Yue 2018

机译：基于人群的文档库的基于编辑的分类及其在维基百科中的应用
6. Computing symmetrical strength of N-grams: a two pass filtering approach in automatic classification of text documents [O] . Deepak Agnihotri, Kesari Verma, Priyanka Tripathi -1

机译：计算N-gram的对称强度：文本文档自动分类中的两遍过滤方法
7. An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification [O] . Zahra Asheghi Dizaji, Sakineh Asghari Aghjehdizaj, Farhad Soleimanian Gharehchopogh 2020

机译：文本文档分类帝国主义竞争算法的支持向量机算法改进

Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅