首页> 外文会议>International Workshop on Semantic and Social Media Adaptation and Personalization >Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm
【24h】

Crowd Sourcing as an Improvement of N-Grams Text Document Classification Algorithm

机译:人群采购作为N克文本文档分类算法的改进

获取原文
获取外文期刊封面目录资料

摘要

A common task in a world of natural language processing is text classification useful for e.g. spam filters, documents sorting, science articles classification or plagiarism detection. This can still be done best and most accurately by human, on the other hand, we can of ten accept certain error in the classification in exchange for its speed. Here, natural language processing mechanism transforms the text in natural language to a form understandable by a classifier such as K-Nearest Neighbour, Decision Trees, Artificial Neural Network or Support Vector Machines. We can also use this human element to help automated classification to improve its accuracy by means of crowdsourcing. This work deals with classification of text documents and its improvement through crowdsourcing. Its goal is to design and implement text documents classifier prototype based on documents similarity and to design evaluation and crowdsourcing-based classification improvement mechanism. For classification the N-grams algorithm has been chosen, which was implemented in Java. Interface for crowdsourcing was created using CMS WordPress. In addition to data collection, the purpose of interface is to evaluate classification accuracy, which leads to extension of classifier test data set, thus the classification is more successful. We have tested our approach on two data sets with promising preliminary results even across different languages. This led to a real-world implementation started at the beginning of 2019 in cooperation of two universities: VšB-TUO and OSU.
机译:在自然语言处理领域中,常见的任务是对例如垃圾邮件过滤器,文档分类,科学文章分类或窃检测。另一方面,这仍然可以由人类最好,最准确地完成,另一方面,我们十个人可以接受分类中的某些错误,以换取其速度。在这里,自然语言处理机制将自然语言中的文本转换为分类器(例如K最近邻居,决策树,人工神经网络或支持向量机)可以理解的形式。我们还可以使用这种人为因素通过众包来帮助自动分类以提高其准确性。这项工作涉及文本文档的分类及其通过众包进行的改进。它的目标是设计和实现基于文档相似性的文本文档分类器原型,并设计评估和基于众包的分类改进机制。为了分类,选择了用Java实现的N-grams算法。使用CMS WordPress创建了用于众包的界面。除数据收集外,界面的目的还在于评估分类准确性,从而扩展了分类器测试数据集,从而使分类更加成功。我们已经在两个数据集上测试了我们的方法,即使在不同语言之间,它们也都具有令人鼓舞的初步结果。这导致了由VšB-TUO和OSU两所大学合作在2019年初开始的实际实施。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号