首页> 外文期刊>Computing and informatics >Experiment on Methods for Clustering and Categorization of Polish Text
【24h】

Experiment on Methods for Clustering and Categorization of Polish Text

机译:波兰语文本聚类和分类方法的实验

获取原文
           

摘要

The main goal of this work was to experimentally verify the methods for a challenging task of categorization and clustering Polish text. Supervised and unsupervised learning was employed respectively for the categorization and clustering. A profound examination of the employed methods was done for the custom-built corpus of Polish texts. The corpus was assembled by the authors from Internet resources. The corpus data was acquired from the news portal and, therefore, it was sorted by type by journalists according to their specialization. The presented algorithms employ Vector Space Model (VSM) and TF-IDF (Term Frequency-Inverse Document Frequency) weighing scheme. Series of experiments were conducted that revealed certain properties of algorithms and their accuracy. The accuracy of algorithms was elaborated regarding their ability to match human arrangement of the documents by the topic. For both the categorization and clustering, the authors used F-measure to assess the quality of allocation.
机译:这项工作的主要目的是通过实验验证用于对波兰文字进行分类和聚类的艰巨任务的方法。监督和非监督学习分别用于分类和聚类。对定制方法的波兰语语料库进行了深入研究,探讨了所采用的方法。该语料库是由作者从Internet资源中收集的。语料库数据是从新闻门户网站获取的,因此,记者根据其专业性按类型对它们进行了排序。提出的算法采用向量空间模型(VSM)和TF-IDF(术语频率-反文档频率)加权方案。进行了一系列实验,揭示了算法的某些属性及其准确性。详细说明了算法根据主题匹配文档的人为排列的能力。对于分类和聚类,作者使用F度量来评估分配质量。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号