Keyphrases of a given document can be considered as its condensed summary. Unsupervised models focus on extracting keyphrases based only on the information contained in that document without interacting with other documents. While a good performance supervised learning model for keyphrase generation requires a massive effort to build training data, which can not generalize to new domains. Moreover, according to human perception, a user would comprehend the topic expressed in a document better if that user has already read other documents that express the same topic. Based on the above idea, we proposed a collaborative keyphrase generation system (CollabKG): a novel semi-supervised method by leveraging limited labeled data. The amount of labeled data will be enriched over time by the user. In our work, we conduct research on a large scale dataset consisting of 500,000 Vietnamese administrative documents. In CollabKG, each document is represented as a feature vector, and a cluster pruning algorithm is employed to accelerate finding the most similar documents. The generated keyphrases were manually evaluated for relevance and accuracy. In the final, the result we achieved shows high ratification. Therefore, we can conclude that CollabKG has good performance and fits a real-time system.
展开▼