首页> 外文会议>Annual German Conference on AI >Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords

【24h】

Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords

机译：基于课程和基于语料库的关键字的偏斜和同质文档语料的分类

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we examine the performance of the two policies for keyword selection over standard document corpora of varying properties. While in corpus-based policy a single set of keywords is selected for all classes globally, in class-based policy a distinct set of keywords is selected for each class locally. We use SVM as the learning method and perform experiments with boolean and tf-idf weighting. In contrast to the common belief, we show that using keywords instead of all words generally yields better performance and tf-idf weighting does not always outperform boolean weighting. Our results reveal that corpus-based approach performs better for large number of keywords while class-based approach performs better for small number of keywords. In skewed datasets, class-based keyword selection performs consistently better than corpus-based approach in terms of macro-averaged F-measure. In homogenous datasets, performances of class-based and corpus-based approaches are similar except for small number of keywords.

机译：在本文中，我们研究了两个策略的表现，以便在不同属性的标准文档语料库中进行关键字选择。虽然在基于语料库的策略中，为全局类中选择单个关键字，但在基于类的策略中，在本地为每个类选择一个不同的关键字。我们使用SVM作为学习方法，并使用布尔和TF-IDF加权进行实验。与常见的信念相比，我们表明，使用关键字而不是所有单词通常会产生更好的性能，而TF-IDF加权并不总是优于布尔加权。我们的结果表明，基于语料库的方法对于大量关键字进行了更好的时间，而基于类的方法对于少量关键字而言更好。在偏斜数据集中，基于类的关键字选择在宏平均f度量方面始终如于基于语料库的方法。除了少量关键字之外，基于基于类的基于语料库的方法的表演是类似的。

著录项

来源
《Annual German Conference on AI》|2007年||共11页
会议地点
作者
Arzucan Ozgur; Tunga Gungor;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP18－53;
关键词
入库时间 2022-08-20 19:45:31

相似文献

外文文献
中文文献
专利

1. Using latent semantic analysis for automated keyword extraction from large document corpora [J] . TU?BA ?NAL SüZEK Turkish Journal of Electrical Engineering and Computer Sciences . 2017,第3期

机译：使用潜在语义分析从大型文档语料库中自动提取关键词
2. A visual attention-based keyword extraction for document classification [J] . Wu Xing, Du Zhikang, Guo Yike Multimedia Tools and Applications . 2018,第19期

机译：基于视觉注意的关键词提取，用于文档分类
3. Keyword Search over Probabilistic XML Documents Based on Node Classification [J] . Zhao Yue, Yuan Ye, Wang Guoren Mathematical Problems in Engineering . 2015,第PTa9期

机译：基于节点分类的概率XML文档关键词搜索
4. Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords [C] . Arzucan Ozgur, Tunga Gungor Annual German Conference on AI . 2007

机译：基于课程和基于语料库的关键字的偏斜和同质文档语料的分类
5. Keywords in the mist: Automated keyword extraction for very large documents and back of the book indexing. [D] . Csomai, Andras. 2008

机译：薄雾中的关键字：自动提取非常大的文档并在书后建立索引的关键字。
6. FacetGist: Collective Extraction of Document Facets in Large Technical Corpora [O] . Tarique Siddiqui, Xiang Ren, Aditya Parameswaran, -1

机译：FacetGist：大型技术语料库中文档构面的集体提取
7. Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords [O] . Arzucan Özgür, Tunga Güngör 2008

机译：基于类和基于语料库的关键字对倾斜和同质文档语料库的分类

Classification of Skewed and Homogenous Document Corpora with Class-Based and Corpus-Based Keywords

摘要

著录项

相似文献

相关主题

期刊订阅