...
首页> 外文期刊>ACM Transactions on Internet Technology >Duplicate Detection in Programming Question Answering Communities
【24h】

Duplicate Detection in Programming Question Answering Communities

机译:编程问题的重复检测回答社区

获取原文
获取原文并翻译 | 示例
           

摘要

Community-based Question Answering (CQA) websites are attracting increasing numbers of users and contributors in recent years. However, duplicate questions frequently occur in CQA websites and are currently manually identified by the moderators. Automatic duplicate detection, on one hand, alleviates this laborious effort for moderators before taking close actions, and, on the other hand, helps question issuers quickly find answers. A number of studies have looked into related problems, but very limited works target Duplicate Detection in Programming CQA (PCQA), a branch of CQA that is dedicated to programmers. Existing works framed the task as a supervised learning problem on the question pairs and relied on only textual features. Moreover, the issue of selecting candidate duplicates from large volumes of historical questions is often unaddressed. To tackle these issues, we model duplicate detection as a two-stage "ranking-classification" problem over question pairs. In the first stage, we rank the historical questions according to their similarities to the newly issued question and select the top ranked ones as candidates to reduce the search space. In the second stage, we develop novel features that capture both textual similarity and latent semantics on question pairs, leveraging techniques in deep learning and information retrieval literature. Experiments on real-world questions about multiple programming languages demonstrate that our method works very well; in some cases, up to 25% improvement compared to the state-of-the-art benchmarks.
机译:基于社区的问题应答(CQA)网站近年来吸引了越来越多的用户和贡献者。但是,在CQA网站中经常发生重复问题,当前由主持人手动标识。自动重复检测一方面,在采取密切行动之前减轻了主持人的这种费力努力,另一方面,帮助问题发行人迅速找到答案。许多研究表明了相关问题,但是在编程CQA(PCQA)中的工作重复检测非常有限,CQA的分支专用于程序员。现有的作品将任务构成为问题对中的监督学习问题,并仅依赖于文本功能。此外,从大量的历史问题中选择候选人重复的问题通常是不合适的。为了解决这些问题,我们将重复检测模型作为问题对作为两阶段“排名分类”问题。在第一阶段,我们根据他们的相似性对新发出的问题进行排名,并选择作为候选人的顶级排名的问题,以减少搜索空间。在第二阶段,我们开发了在问题对上捕获文本相似性和潜在语义的新功能,利用深度学习和信息检索文献中的技术。关于多种编程语言的现实世界问题的实验表明我们的方法很好;在某些情况下,与最先进的基准相比,高达25%的改进。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号