首页> 外文会议>Workshop on e-Commerce and NLP >Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

【24h】

Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

机译：规范化电子商务文本属性的可扩展方法（SANTA）

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jac-card index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic-techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings.

机译：在本文中，我们提出了SANTA，这是一个可扩展的框架，可以自动将电子商务属性值（如“Win 10 Pro”）规范化为一组固定的预定义规范值（如“Windows 10”）。早期关于属性规范化的工作主要集中在模糊字符串匹配（本文中也称为句法匹配）。在这项工作中，我们首先对九种句法匹配算法进行了广泛的研究，并确定“余弦”相似性导致最佳结果，比常用的Jac卡片索引提高了2.7%。接下来，我们认为字符串相似性本身不足以实现属性规范化，因为许多表面形式需要超越语法匹配（例如，“720p”和“HD”是同义词）。虽然像无监督嵌入（例如word2vec/fastText）这样的语义技术在单词相似性任务中显示出良好的效果，但我们观察到它们在区分紧密规范形式方面表现不佳，因为这些紧密形式经常出现在相似的上下文中。我们建议使用具有三重态损耗的双网络来学习令牌嵌入。我们提出了一个嵌入学习任务，利用原始属性值和产品标题以自我监督的方式学习这些嵌入。我们表明，与基于语法和无监督嵌入的属性规范化技术相比，使用我们提出的任务提供监督可以提高性能。在一个包含50个属性的真实属性规范化数据集上的实验表明，使用我们提出的方法训练的嵌入比最佳字符串匹配提高了2.3%，比最佳无监督嵌入提高了19.3%。

著录项

来源
《Workshop on e-Commerce and NLP》|2021年|101-110|共10页
会议地点
作者
Ravi Shankar Mishra; Kartik Mehta; Nikhil Rasiwasia;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词

相似文献

外文文献
中文文献
专利

1. An enhanced text categorization method based on improved text frequency approach and mutual information algorithm [J] . Pei Zhili, Shi Xiaohu, Maurizio Marchese, 自然科学进展：英文版 . 2007,第012期
2. An enhanced text categorization method based on improved text frequency approach and mutual information algorithm [J] . 自然科学进展（英文版） . 2007,第012期
3. EFFECTIVE SEMANTIC TEXT SIMILARITY METRIC USING NORMALIZED ROOT MEAN SCALED SQUARE ERROR [J] . ISSA ATOUM, MARUTHI ROHIT AYYAGARI Journal of Theoretical and Applied Information Technology . 2019,第12期

机译：使用归一化均方根平方误差的有效语义文本相似度度量
4. Intelligent product brokering for e-commerce: an incremental approach to unaccounted attribute detection [J] . Sheng-Uei Guan, Ping Cheng Tan, Tai Kheng Chan Electronic commerce research and applications . 2004,第1a4期

机译：电子商务的智能产品代理：用于未说明属性检测的增量方法
5. FAE-GAN: facial attribute editing with multi-scale attention normalization [J] . Jiaqi Zhu, Pengxiang Ouyang, Ran Tao, Machine Vision and Applications . 2021,第4期

机译：Fae-GaN：具有多种关注标准化的面部属性编辑
6. LaTeX-Numeric: Language-agnostic Text attribute extraction for E-commerce Numeric Attributes [C] . Kartik Mehta, Ioana Oprea, Nikhil Rasiwasia Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies . 2021

机译：乳胶数字：电子商务数字属性的语言 - 不可知文本属性提取
7. Influence of Food Attributes on Purchasing Behavior of Food Desert Residents in Eastern Greensboro, NC: A Multidimensional Scaling Approach [D] . Dorbu, Freda Elikem. 2020

机译：食品属性对木博博，NC东部食品沙漠居民采购行为的影响：多维缩放方法
8. Applying a deep learning-based sequence labeling approach to detect attributes of medical concepts in clinical text [O] . Jun Xu, Zhiheng Li, Qiang Wei, 2019

机译：应用基于深度学习的序列标记方法来检测临床文本中医学概念的属性
9. Intelligent Product Brokering for E-Commerce: An Incremental Approach to Unaccounted Attribute Detection [O] . Guan, SU, Tan, PC, Chan, TK 2004

机译：电子商务的智能产品代理：一种增量方法，用于无法解释的属性检测
10. Automated Feature Attribute Accessing from Map Text [R] . Hasenfus, S. F. 1988

机译：从地图文本访问自动特征属性

Scalable Approach for Normalizing E-commerce Text Attributes (SANTA)

摘要

著录项

相似文献

相关主题

期刊订阅