Text mining is a knowledge intensive process with the main purpose of effectively and efficiently processing large amounts of unstructured data. Due to the rapidly growing amount of raw text available there is a strong need for methods that are capable of dealing with this in terms of automatic classification or indexing. In this context, an essential task is the semantic processing of natural language in order to provide a sound input to the text classification or categorization task. One of the important tasks is stemming which is the process of reducing a certain word to its root (or stem). When a text is pre-processed for mining purposes, stemming is applied in order to bring words from their current variation to their original root in order to better process the natural language with subsequent steps. A challenging task is that of stemming composite words which in many languages form a large part of the daily used vocabulary. In this paper we develop a novel rule-based algorithm for stemming composite words and we show through extensive experiments that the text classification accuracy greatly improves by stemming composite words.
展开▼