首页> 外文会议>International conference on electronic commerce and web technologies >The WDC Gold Standards for Product Feature Extraction and Product Matching
【24h】

The WDC Gold Standards for Product Feature Extraction and Product Matching

机译:WDC产品特征提取和产品匹配的金标准

获取原文

摘要

Finding out which e-shops offer a specific product is a central challenge for building integrated product catalogs and comparison shopping portals. Determining whether two offers refer to the same product involves extracting a set of features (product attributes) from the web pages containing the offers and comparing these features using a matching function. The existing gold standards for product matching have two shortcomings: (i) they only contain offers from a small number of e-shops and thus do not properly cover the heterogeneity that is found on the Web. (ii) they only provide a small number of generic product attributes and therefore cannot be used to evaluate whether detailed product attributes have been correctly extracted from textual product descriptions. To overcome these shortcomings, we have created two public gold standards: The WDC Product Feature Extraction Gold Standard consists of over 500 product web pages originating from 32 different websites on which we have annotated all product attributes (338 distinct attributes) which appear in product titles, product descriptions, as well as tables and lists. The WDC Product Matching Gold Standard consists of over 75 000 correspondences between 150 products (mobile phones, TVs, and headphones) in a central catalog and offers for these products on the 32 web sites. To verify that the gold standards are challenging enough, we ran several baseline feature extraction and matching methods, resulting in F-score values in the range 0.39 to 0.67. In addition to the gold standards, we also provide a corpus consisting of 13 million product pages from the same websites which might be useful as background knowledge for training feature extraction and matching methods.
机译:找出哪些电子商店提供特定产品是构建集成产品目录和比较购物门户的主要挑战。确定两个商品是否引用相同的产品涉及从包含商品的网页中提取一组功能(产品属性),并使用匹配功能比较这些功能。现有的用于产品匹配的黄金标准有两个缺点:(i)它们仅包含来自少量电子商店的报价,因此不能适当地涵盖Web上发现的异构性。 (ii)它们仅提供少量的通用产品属性,因此不能用于评估是否已从文本产品描述中正确提取了详细的产品属性。为了克服这些缺点,我们创建了两个公共金标准:WDC产品特征提取金标准包括来自32个不同网站的500多个产品网页,我们在其上标注了出现在产品标题中的所有产品属性(338个不同属性) ,产品说明以及表格和列表。 WDC产品匹配金标准包括中央目录中150种产品(手机,电视和耳机)之间的超过75,000种对应,并在32个网站上提供了这些产品的信息。为了验证黄金标准是否具有足够的挑战性,我们运行了几种基线特征提取和匹配方法,得出的F分数在0.39至0.67的范围内。除了黄金标准之外,我们还提供了一个语料库,该语料库由来自同一网站的1300万个产品页面组成,这些页面可能可用作训练特征提取和匹配方法的背景知识。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号