Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base

首页> 外文期刊>Knowledge-Based Systems >Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base

【24h】

Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base

机译：基于从概率知识库中自动获取知识的文档分类的概念包表示

获取原文

获取原文并翻译 | 示例

开具论文收录证明 >>

页面导航

摘要
著录项
引文网络
相似文献
相关主题

摘要

Text representation, a crucial step for text mining and natural language processing, concerns about transforming unstructured textual data into structured numerical vectors to support various machine learning and data mining algorithms. For document classification, one classical and commonly adopted text representation method is Bag-of-Words (BoW) model. BoW represents document as a fixed-length vector of terms, where each term dimension is a numerical value such as term frequency or tf-idf weight. However, BoW simply looks at surface form of words. It ignores the semantic, conceptual and contextual information of texts, and also suffers from high dimensionality and sparsity issues. To address the aforementioned issues, we propose a novel document representation scheme called Bag-of-Concepts (BoC), which automatically acquires useful conceptual knowledge from external knowledge base, then conceptualizes words and phrases in the document into higher level semantics (i.e. concepts) in a probabilistic manner, and eventually represents a document as a distributed vector in the learned concept space. By utilizing background knowledge from knowledge base, BoC representation is able to provide more semantic and conceptual information of texts, as well as better interpretability for human understanding. We also propose Bag-of-Concept-Clusters (BoCCl) model which clusters semantically similar concepts together and performs entity sense disambiguation to further improve BoC representation. In addition, we combine BoCCl and BoW representations using an attention mechanism to effectively utilize both concept-level and word-level information and achieve optimal performance for document classification. (c) 2019 Published by Elsevier B.V.

机译：文本表示是文本挖掘和自然语言处理的关键步骤，它涉及将非结构化文本数据转换为结构化数值向量以支持各种机器学习和数据挖掘算法的问题。对于文档分类，一种经典且常用的文本表示方法是词袋（BoW）模型。 BoW将文档表示为术语的固定长度向量，其中每个术语维是一个数字值，例如术语频率或tf-idf权重。但是，BoW只看单词的表面形式。它忽略了文本的语义，概念和上下文信息，并且还存在高维度和稀疏性的问题。为了解决上述问题，我们提出了一种新颖的文档表示方案，称为概念包（BoC），该方案会自动从外部知识库中获取有用的概念知识，然后将文档中的单词和短语概念化为高级语义（即概念）最终以概率的方式将文档表示为学习的概念空间中的分布式矢量。通过利用知识库中的背景知识，BoC表示能够提供更多的文本语义和概念信息，并为人类理解提供更好的可解释性。我们还提出了概念包（BoCCl）模型，该模型将语义相似的概念聚类在一起，并执行实体意义上的歧义消除，以进一步改善BoC表示。此外，我们使用注意机制将BoCCl和BoW表示形式结合起来，以有效利用概念级别和单词级别的信息，并实现文档分类的最佳性能。（c）2019由Elsevier B.V.发布

著录项

来源
《Knowledge-Based Systems》 |2020年第6期|105436.1-105436.14|共14页
作者

展开▼
作者单位

Nanyang Technol Univ Sch Elect & Elect Engn 50 Nanyang Ave Singapore 639798 Singapore;

Nanyang Technol Univ Interdisciplinary Grad Sch 21 Nanyang Link Singapore 637371 Singapore;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类
关键词
Natural language processing; Text representation; Document classification; Knowledge base; Interpretability;

机译：自然语言处理;文字表示;文件分类;知识库;可解释性;

相似文献

外文文献
中文文献
专利

1. Fusion of probabilistic knowledge-based classification rules and learning automata for automatic recognition of digital images [J] . Dario Maravall, Javier de Lope, Juan Pablo Fuentes Pattern recognition letters . 2013,第14期

机译：基于概率知识的分类规则与学习自动机的融合，可自动识别数字图像
2. Research on Patent-Knowledge Representation and Automatic Classification Based on Situation Mapping [J] . Zilin Xu, Wenqiang Li, Yan Li, Mobile information systems . 2020,第1期

机译：基于情况映射的专利知识表示与自动分类研究
3. Knowledge Representation in Probabilistic Spatio-Temporal Knowledge Bases [J] . Grant John, Parisi Francesco The Journal of Artificial Intelligence Research . 2016,第12期

机译：概率时空知识库中的知识表示
4. Fully Automatic Acquisition of Taxonomic Knowledge from Large Corpora of Texts: Limited Syntax Knowledge Representation System Based on Natural Language [C] . Lucja Iwanska, Naveen Mata, Kellyn Kruger International symposium on foundations of intelligent systems . 1999

机译：全自动自动获取来自文本的大型语料的分类学知识：基于自然语言的有限语法知识表示系统
5. A framework for knowledge acquisition, representation and problem-solving in knowledge-based planning. [D] . Martinez-Bermudez, Iliana. 2001

机译：基于知识的计划中知识获取，表示和问题解决的框架。
6. Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach [O] . Marcos Antonio Mouriño García, Roberto Pérez Rodríguez, Luis E. Anido Rifón -1

机译：使用百科全书知识进行生物医学文献分类：基于维基百科的概念袋方法
7. Biomedical literature classification using encyclopedic knowledge: a Wikipedia-based bag-of-concepts approach [O] . Marcos Antonio Mouriño García, Roberto Pérez Rodríguez, Luis E. Anido Rifón 2015

机译：使用百科知识的生物医学文献分类：基于维基百科的概念袋方法
8. Utilizing Data and Knowledge Mining for Probabilistic Knowledge Bases [R] . Stein, D. J. 1996

机译：利用数据和知识挖掘实现概率知识库

Bag-of-Concepts representation for document classification based on automatic knowledge acquisition from probabilistic knowledge base

摘要

著录项

引文网络

相似文献

相关主题

期刊订阅