With the rapid growth of the Web, there is a need of high-performance techniques for document collection and classification. The goal of our research is to develop a platform to discover English, traditional and simplified Chinese documents from the Web in the Greater China area and classify them into a large number of subject classes. Three major challenges are encountered. First, the collection (i.e., the Web) is dynamic: new documents are added in and the features of subject classes change constantly. Second, the documents should be classified in a large-scale taxonomy. Third, the collection contains documents written in different languages. A PAT-tree-based approach is developed to deal with document classification in dynamic collections. It uses PAT tree as a working structure to extract keyterms from documents in each subject class and then update the features of the class accordingly. The feedback will contribute to the classification of the incoming documents immediately. In addition, we make use of a manually-constructed keyterms to serve as the base of document classification in a large-scale taxonomy. Two sets of experiments were done to evaluate the classification performance in a dynamic collection and in a large-scale taxonomy respectively. Both of the experiments yielded encouraging results. We further suggest an approach extended from the PAT-tree-based working structure to deal with classification in multilingual documents.
展开▼