Discovering and understanding the development of research topics in the community is useful for identifying important milestones and prominent researches. Recent works related to detect topics from scientific corpus also used the latent Dirichlet Allocation (LDA) to explore topics of papers. These systems usually used abstract of papers as the corpus instead of full papers. However, the LDA is based on the bag-of-words model so with such short texts it will give low accuracy. The tendency for improvement is to add prior knowledge to the analysis process with the latest algorithm, Source-LDA, which was presented by Justin Wood et al. at UCLA in 2017. We found that the Source-LDA has some shortcomings to overcome. Firstly, it is also based on counting method as LDA so short text will decrease the accuracy. Secondly, the knowledge source mentioned in the algorithm is constructed manually from labeled text data. This make Source-LDA becomes a supervised method. Therefore, we propose an approach to automatically construct knowledge source for Source-LDA from unlabeled data with an assumption that a specific paper will often cite papers which contain related topics. This approach both helps to integrate source knowledge in an unsupervised manner and resolve the issue of short text by using information from citation network. In the first stage, the propound method has achieved encouraging results.
展开▼