首页> 外文期刊>Language Resources and Evaluation >Classifying unlabeled short texts using a fuzzy declarative approach
【24h】

Classifying unlabeled short texts using a fuzzy declarative approach

机译:使用模糊声明方法对未标记的短文本进行分类

获取原文
获取原文并翻译 | 示例
       

摘要

Web 2.0 provides user-friendly tools that allow persons to create and publish content online. User generated content often takes the form of short texts (e.g., blog posts, news feeds, snippets, etc). This has motivated an increasing interest on the analysis of short texts and, specifically, on their categorisation. Text categorisation is the task of classifying documents into a certain number of predefined categories. Traditional text classification techniques are mainly based on word frequency statistical analysis and have been proved inadequate for the classification of short texts where word occurrence is too small. On the other hand, the classic approach to text categorization is based on a learning process that requires a large number of labeled training texts to achieve an accurate performance. However labeled documents might not be available, when unlabeled documents can be easily collected. This paper presents an approach to text categorisation which does not need a pre-classified set of training documents. The proposed method only requires the category names as user input. Each one of these categories is defined by means of an ontology of terms modelled by a set of what we call proximity equations. Hence, our method is not category occurrence frequency based, but highly depends on the definition of that category and how the text fits that definition. Therefore, the proposed approach is an appropriate method for short text classification where the frequency of occurrence of a category is very small or even zero. Another feature of our method is that the classification process is based on the ability of an extension of the standard Prolog language, named Bousi~Prolog, for flexible matching and knowledge representation. This declarative approach provides a text classifier which is quick and easy to build, and a classification process which is easy for the user to understand. The results of experiments showed that the proposed method achieved a reasonably useful performance.
机译:Web 2.0提供了用户友好的工具,使人们可以在线创建和发布内容。用户生成的内容通常采用短文本的形式(例如,博客文章,新闻提要,摘要等)。这引起了人们对短文本分析,特别是对它们的分类的关注。文本分类是将文档分类为一定数量的预定义类别的任务。传统的文本分类技术主要基于词频统计分析,并且已被证明不足以对出现词次数太少的短文本进行分类。另一方面,经典的文本分类方法基于学习过程,该过程需要大量带标签的训练文本才能实现准确的性能。但是,如果可以轻松收集未标记的文档,则可能无法使用标记的文档。本文提出了一种文本分类方法,该方法不需要预先分类的培训文档集。所提出的方法仅需要类别名称作为用户输入。这些类别中的每一个都是通过术语的本体来定义的,这些术语由一组我们称为邻近方程的模型建模。因此,我们的方法不是基于类别出现频率,而是很大程度上取决于该类别的定义以及文本如何适合该定义。因此,所提出的方法是一种适用于短文本分类的适当方法,其中类别的出现频率非常小甚至为零。我们方法的另一个特点是,分类过程是基于扩展名为Bousi〜Prolog的Prolog语言进行灵活匹配和知识表示的能力。这种声明性方法提供了一种快速且易于构建的文本分类器,以及一个易于用户理解的分类过程。实验结果表明,该方法取得了较好的效果。

著录项

  • 来源
    《Language Resources and Evaluation》 |2013年第1期|151-178|共28页
  • 作者单位

    Department of Information Technologies and Systems, University of Castilla La Mancha, Paseo de la Universidad, 4, 13071 Ciudad Real, Spain;

    Department of Information Technologies and Systems, University of Castilla La Mancha, Paseo de la Universidad, 4, 13071 Ciudad Real, Spain;

    Department of Information Technologies and Systems, University of Castilla La Mancha, Paseo de la Universidad, 4, 13071 Ciudad Real, Spain;

    Department of Information Technologies and Systems, University of Castilla La Mancha, Paseo de la Universidad, 4, 13071 Ciudad Real, Spain;

    Department of Computer Science, Universidad Autonoma del Carmen,Ciudad del Carmen, CP 24160 Campeche, Mexico;

  • 收录信息
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    Text categorization; Ontologies; Thesauri; Unlabeled short texts;

    机译:文字分类;本体;叙词表;无标签的短文本;
  • 入库时间 2022-08-17 13:17:19

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号