首页> 外文会议>Advances in Knowledge Discovery and Data Mining >Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web
【24h】

Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

机译:Web中多语言文档分类的关键术语的增量提取

获取原文

摘要

With the rapid growth of the Web, there is a need of high-performance techniques for document collection and classification. The goal of our research is to develop a platform to discover English, traditional and simplified Chinese documents from the Web in the Greater China area and classify them into a large number of subject classes. Three major challenges are encountered. First, the collection (i.e., the Web) is dynamic: new documents are added in and the features of subject classes change constantly. Second, the documents should be classified in a large-scale taxonomy. Third, the collection contains documents written in different languages. A PAT-tree-based approach is developed to deal with document classification in dynamic collections. It uses PAT tree as a working structure to extract keyterms from documents in each subject class and then update the features of the class accordingly. The feedback will contribute to the classification of the incoming documents immediately. In addition, we make use of a manually-constructed keyterms to serve as the base of document classification in a large-scale taxonomy. Two sets of experiments were done to evaluate the classification performance in a dynamic collection and in a large-scale taxonomy respectively. Both of the experiments yielded encouraging results. We further suggest an approach extended from the PAT-tree-based working structure to deal with classification in multilingual documents.
机译:随着Web的快速发展,需要用于文档收集和分类的高性能技术。我们研究的目标是开发一个平台,以便从大中华地区的网络中发现英文,繁体中文和简体中文文档,并将它们分类为大量主题类。遇到三个主要挑战。首先,集合(即Web)是动态的:添加了新文档,并且主题类的功能不断变化。其次,应按大规模分类法对文档进行分类。第三,馆藏包含用不同语言编写的文档。开发了一种基于PAT树的方法来处理动态集合中的文档分类。它使用PAT树作为工作结构,从每个主题类的文档中提取关键术语,然后相应地更新类的功能。反馈将有助于立即对收到的文档进行分类。此外,我们利用手动构建的关键术语作为大规模分类法中文档分类的基础。进行了两组实验,分别评估了动态集合和大规模分类法中的分类性能。两项实验均产生令人鼓舞的结果。我们进一步建议一种方法,该方法应从基于PAT树的工作结构扩展到处理多语言文档中的分类。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号