首页> 外文学位 >Text classification using a hidden Markov model.
【24h】

Text classification using a hidden Markov model.

机译:使用隐马尔可夫模型进行文本分类。

获取原文
获取原文并翻译 | 示例

摘要

Text categorization (TC) is the task of automatically categorizing textual digital documents into pre-set categories by analyzing their contents. The purpose of this study is to develop an effective TC model to resolve the difficulty of automatic classification. In this study, two primary goals are intended. First, a Hidden Markov Model (HAM is proposed as a relatively new method for text categorization. HMM has been applied to a wide range of applications in text processing such as text segmentation and event tracking, information retrieval, and information extraction. Few, however, have applied HMM to TC. Second, the Library of Congress Classification (LCC) is adopted as a classification scheme for the HMM-based TC model for categorizing digital documents. LCC has been used only in a handful of experiments for the purpose of automatic classification. In the proposed framework, a general prototype for an HMM-based TC model is designed, and an experimental model based on the prototype is implemented so as to categorize digitalized documents into LCC. A sample of abstracts from the ProQuest Digital Dissertations database is used for the test-base. Dissertation abstracts, which are pre-classified by professional librarians, form an ideal test-base for evaluating the proposed model of automatic TC. For comparative purposes, a Naive Bayesian model, which has been extensively used in TC applications, is also implemented. Our experimental results show that the performance of our model surpasses that of the Naive Bayesian model as measured by comparing the automatic classification of abstracts to the manual classification performed by professionals.
机译:文本分类(TC)是通过分析文本数字文档的内容来自动将文本数字文档分类为预设类别的任务。这项研究的目的是开发一种有效的TC模型,以解决自动分类的难题。在这项研究中,有两个主要目标。首先,提出了一种隐马尔可夫模型(HAM)作为一种相对较新的文本分类方法。HMM已被广泛应用于文本处理中,例如文本分段和事件跟踪,信息检索和信息提取等,但是很少。 ,已将HMM应用于TC;其次,国会图书馆分类(LCC)被用作基于HMM的TC模型的分类方案,用于对数字文档进行分类; LCC仅在少数实验中用于自动在提出的框架中,设计了基于HMM的TC模型的通用原型,并实现了基于该原型的实验模型,以将数字化文档分类为LCC,并从ProQuest Digital Dissertations数据库中提取了摘要样本。由专业图书馆员预先分类的学位论文摘要构成了评估自动机模型的理想测试基础ic TC。为了进行比较,还实现了已在TC应用中广泛使用的朴素贝叶斯模型。我们的实验结果表明,通过将摘要的自动分类与专业人员进行的手动分类进行比较,我们的模型的性能优于朴素贝叶斯模型。

著录项

  • 作者

    Yi, Kwan.;

  • 作者单位

    McGill University (Canada).;

  • 授予单位 McGill University (Canada).;
  • 学科 Information Science.;Artificial Intelligence.
  • 学位 Ph.D.
  • 年度 2005
  • 页码 182 p.
  • 总页数 182
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号