Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

机译：Web中多语言文档分类的关键术语的增量提取

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

With the rapid growth of the Web, there is a need of high-performance techniques for document collection and classification. The goal of our research is to develop a platform to discover English, traditional and simplified Chinese documents from the Web in the Greater China area and classify them into a large number of subject classes. Three major challenges are encountered. First, the collection (i.e., the Web) is dynamic: new documents are added in and the features of subject classes change constantly. Second, the documents should be classified in a large-scale taxonomy. Third, the collection contains documents written in different languages. A PAT-tree-based approach is developed to deal with document classification in dynamic collections. It uses PAT tree as a working structure to extract keyterms from documents in each subject class and then update the features of the class accordingly. The feedback will contribute to the classification of the incoming documents immediately. In addition, we make use of a manually-constructed keyterms to serve as the base of document classification in a large-scale taxonomy. Two sets of experiments were done to evaluate the classification performance in a dynamic collection and in a large-scale taxonomy respectively. Both of the experiments yielded encouraging results. We further suggest an approach extended from the PAT-tree-based working structure to deal with classification in multilingual documents.

机译：随着Web的快速发展，需要用于文档收集和分类的高性能技术。我们研究的目标是开发一个平台，以便从大中华地区的网络中发现英文，繁体中文和简体中文文档，并将它们分类为大量主题类。遇到三个主要挑战。首先，集合（即Web）是动态的：添加了新文档，并且主题类的功能不断变化。其次，应按大规模分类法对文档进行分类。第三，馆藏包含用不同语言编写的文档。开发了一种基于PAT树的方法来处理动态集合中的文档分类。它使用PAT树作为工作结构，从每个主题类的文档中提取关键术语，然后相应地更新类的功能。反馈将有助于立即对收到的文档进行分类。此外，我们利用手动构建的关键术语作为大规模分类法中文档分类的基础。进行了两组实验，分别评估了动态集合和大规模分类法中的分类性能。两项实验均产生令人鼓舞的结果。我们进一步建议一种方法，该方法应从基于PAT树的工作结构扩展到处理多语言文档中的分类。

著录项

来源
《Advances in Knowledge Discovery and Data Mining》|2002年|p.506-516|共11页
会议地点
作者
Lee-Feng Chien; Chien-Kang Huang; Hsin-Chen Chiao; Shih-Jui Lin;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类自动化技术、计算机技术;
关键词

相似文献

外文文献
中文文献
专利

1. Information Extraction in Unstructured Multilingual Web Documents [J] . Kolla Bhanu Prakash, M. A. Dorai Rangaswamy, T. V. Ananthan, Indian Journal of Science and Technology . 2015,第16期

机译：非结构化多语言Web文档中的信息提取
2. Leveraging Wikipedia knowledge to classify multilingual biomedical documents [J] . Mourino Garcia Marcos Antonio, Perez Rodriguez Roberto, Anido Rifon Luis Artificial intelligence in medicine . 2018,第JUNa期

机译：利用Wikipedia知识对多语言生物医学文档进行分类
3. Classifying web genres in context: A case study documenting the web genres used by a software engineer [J] . Michela Montesi, Trilce Navarrete Information Processing & Management . 2008,第4期

机译：在上下文中对网络类型进行分类：案例研究，记录了软件工程师使用的网络类型
4. Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web [C] . Lee-Feng Chien, Chien-Kang Huang, Hsin-Chen Chiao, Pacific-Asia Conference on Knowledge Discovery and Data Mining . 2002

机译：keyterms的增量提取，用于在Web中进行分类多语种文档
5. Effect of ontology hierarchy on a concept vector machine's ability to classify web documents. [D] . Graham, Jeffrey A. 2009

机译：本体层次结构对概念向量机对Web文档进行分类的能力的影响。
6. Extraction of a group-pair relation: problem-solving relation from web-board documents [O] . Chaveevan Pechsiri, Rapepun Piriyakul -1

机译：组对关系的提取：Web板文档中的问题解决关系
7. El proyecto europeo MedIEQ (Quality Labelling of Medical Web Content Using Multilingual Information Extraction): la web semántica al servicio de los usuarios de salud [O] . Mayer Pujadas Miquel Ángel, Leis Machín Angela, Karkaletsis Vangelis, 2006

机译：欧洲项目MedIEQ（使用多语言信息提取对医疗Web内容进行质量标记）：为卫生用户服务的语义Web

Incremental Extraction of Keyterms for Classifying Multilingual Documents in the Web

摘要

著录项

相似文献

相关主题

期刊订阅