首页> 外文会议>International conference on intelligent text processing and computational linguistics >A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models
【24h】

A New Russian Paraphrase Corpus. Paraphrase Identification and Classification Based on Different Prediction Models

机译:新的俄罗斯释义语料库。基于不同预测模型的复述识别和分类

获取原文

摘要

Our main objectives are constructing a paraphrase corpus for Russian and developing of the paraphrase identification and classification models based on this corpus. The corpus consists of pairs of news headlines from different media agencies which are extracted and analyzed in real time. Paraphrase candidates are extracted using an unsupervised matrix similarity metric: if the metric value satisfies a certain threshold, the corresponding pair of sentences is included in the corpus. These pairs of sentences are further annotated via crowdsourcing. We provide a user-friendly online interface for crowdsourced annotation which is available at http://paraphraser.ru. There are 7480 annotated sentence pairs in the corpus at the moment, and there are still more to come. The types and the features of these sentence pairs are not introduced to the annotators. We adopt a 3-classes classification of paraphrases and distinguish precise paraphrases (conveying the same meaning), loose paraphrases (conveying similar meaning) and non-paraphrases (conveying different meaning).
机译:我们的主要目标是为俄语构建释义语料库,并基于该语料库开发释义语识别和分类模型。语料库由来自不同媒体机构的成对新闻标题组成,这些新闻标题是实时提取和分析的。使用无监督的矩阵相似性度量来提取复述候选词:如果度量值满足某个阈值,则在语料库中包含相应的句子对。这些句子对通过众包进一步注释。我们为众包注释提供了一个用户友好的在线界面,该界面可从http://paraphraser.ru获得。目前,语料库中有7480个带注释的句子对,并且还有更多。这些句子对的类型和特征未引入注释者。我们采用复述的三类分类,并区分精确的复述(传达相同的意思),松散的释义(传达相似的意思)和非复述(传达不同的意思)。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号