首页> 外文会议>Workshop on e-Commerce and NLP >Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)
【24h】

Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

机译:规范化电子商务文本属性的可扩展方法(SANTA)

获取原文

摘要

In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jac-card index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic-techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings.
机译:在本文中,我们提出了SANTA,这是一个可扩展的框架,可以自动将电子商务属性值(如“Win 10 Pro”)规范化为一组固定的预定义规范值(如“Windows 10”)。早期关于属性规范化的工作主要集中在模糊字符串匹配(本文中也称为句法匹配)。在这项工作中,我们首先对九种句法匹配算法进行了广泛的研究,并确定“余弦”相似性导致最佳结果,比常用的Jac卡片索引提高了2.7%。接下来,我们认为字符串相似性本身不足以实现属性规范化,因为许多表面形式需要超越语法匹配(例如,“720p”和“HD”是同义词)。虽然像无监督嵌入(例如word2vec/fastText)这样的语义技术在单词相似性任务中显示出良好的效果,但我们观察到它们在区分紧密规范形式方面表现不佳,因为这些紧密形式经常出现在相似的上下文中。我们建议使用具有三重态损耗的双网络来学习令牌嵌入。我们提出了一个嵌入学习任务,利用原始属性值和产品标题以自我监督的方式学习这些嵌入。我们表明,与基于语法和无监督嵌入的属性规范化技术相比,使用我们提出的任务提供监督可以提高性能。在一个包含50个属性的真实属性规范化数据集上的实验表明,使用我们提出的方法训练的嵌入比最佳字符串匹配提高了2.3%,比最佳无监督嵌入提高了19.3%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号