首页> 外文会议>International Conference on Advanced Language Processing and Web Information Technology >Near-Duplicates Detection for Vietnamese Documents In Large Database
【24h】

Near-Duplicates Detection for Vietnamese Documents In Large Database

机译:大型数据库中越南文档的近副本检测

获取原文

摘要

Near-duplicate documents exacerbate the problem of information overload. Research in detecting near-duplicates has attracted a lot of attention from both industry and academia. In this paper, we focus on addressing this problem for Vietnamese documents which, to the best of our knowledge, has not been done before. Most of the current algorithms have been designed for English which are not directly applicable to Vietnamese - a monosyllabic language. We propose to combine Charikar's algorithm [2] with a "weighting scheme" and Vietnamese specific features to address the language intricacy. Experimental results indicate that our scheme is effective for detecting near-duplicates in a corpus of Vietnamese documents.
机译:近重复文件加剧了信息过载问题。检测近复制品的研究吸引了行业和学术界的大量关注。在本文中,我们专注于解决越南文件的这个问题,这据我们所知,尚未以前做过。大多数当前算法设计用于英语,这些算法不直接适用于越南语 - 单音族语言。我们建议将Charikar的算法[2]与“加权方案”和越南特定功能结合起来以解决语言复杂性。实验结果表明,我们的计划对于检测越南文献的语料库中的近双药物是有效的。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号