For complex text data, especially for long text data, in order to measure the text similarity, the traditional methods are not accurate enough. We found that it is mainly because the feature representation ability is not strong enough. To improve the accuracy of long text similarity, an algorithm based on pre-training deep learning model is proposed to extract features of long text. On the benchmark data set of THUCNews corpus, the accuracy of our method is 5.4% higher than that of the traditional algorithm. Besides, we perform ablation experiments to test the improvement of fine-tuning technology.
展开▼