【24h】

SemGen-Towards a Semantic Data Generator for Benchmarking Duplicate Detectors

机译:SemGen-走向用于对重复检测器进行基准测试的语义数据生成器

获取原文

摘要

Benchmarking the quality of duplicate detection methods requires comprehensive knowledge on duplicate pairs in addition to sufficient size and variability of test data sets. While extending real-world data sets with artificially created data is promising, current approaches to such synthetic data generation, however, work solely on a quantitative level, which entails that duplicate semantics are only implicitly represented, leading to only insufficiently configurable variability. In this paper we propose SemGen, a semantics-driven approach to synthetic data generation. SemGen first diversifies real-world objects on a qualitative level, before in a second step quantitative values are generated. To demonstrate the applicability of SemGen, we propose how to define duplicate semantics for the domain of road traffic management. A discussion of lessons learned concludes the paper.
机译:对重复检测方法的质量进行基准测试,除了要有足够的测试数据集大小和可变性之外,还需要对重复对有全面的了解。尽管用人工创建的数据扩展现实世界的数据集是有希望的,但是当前用于合成数据生成的方法仅在定量级别上起作用,这意味着重复的语义仅被隐式表示,从而导致仅可配置的可变性不足。在本文中,我们提出了SemGen,这是一种语义驱动的合成数据生成方法。 SemGen首先在定性水平上使现实世界的对象多样化,然后再生成定量值。为了证明SemGen的适用性,我们提出了如何为道路交通管理领域定义重复语义。本文总结了总结经验教训。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号