首页> 外文期刊>Information Systems >Web-based closed-domain data extraction on online advertisements
【24h】

Web-based closed-domain data extraction on online advertisements

机译:基于网络的在线广告封闭域数据提取

获取原文
获取原文并翻译 | 示例
获取外文期刊封面目录资料

摘要

Taking advantage of the popularity of the web, online marketplaces such as Ebay (.com), advertisements (ads for short) websites such as Craigslist(.org), and commercial websites such as Carmax(.com) (allow users to) post ads on a variety of products and services. Instead of browsing through numerous websites to locate ads of interest, web users would benefit from the existence of a single, fully integrated database (DB) with ads in multiple domains, such as Cars-for-Sale and Job-Postings, populated from various online sources so that ads of interest could be retrieved at a centralized site. Since existing ads websites impose their own structures and formats for storing and accessing ads, generating a uniform, integrated ads repository is not a trivial task. The challenges include (i) identifying ads domains, (ii) dealing with the diversity in structures of ads in various ads domains, and (iii) analyzing data with different meanings in each ads domain. To handle these problems, we introduce ADEx, a tool that relies on various machine learning approaches to automate the process of extracting (un-/semi-/fully-structured) data from online ads to create ads records archived in an underlying DB through domain classification, keyword tagging, and identification of valid attribute values. Experimental results generated using a dataset of 18,000 online ads originated from Craigslist, Ebay, and KSL(.com) show that ADEx is superior in performance compared with existing text classification, keyword labeling, and data extraction approaches. Further evaluations verify that ADEx either outperforms or performs at least as good as current state-of-the-art information extractors in mapping data from unstructured or (semi-)structured sources into DB records.
机译:利用网络的流行性,诸如Ebay(.com)的在线市场,诸如Craigslist(.org)的广告(简称广告)网站和诸如Carmax(.com)的商业网站(允许用户)发布在各种产品和服务上投放广告。网络用户无需浏览众多网站来查找感兴趣的广告,而会受益于单个完全集成的数据库(DB)的存在,该数据库具有多个域中的广告,例如来自各种领域的待售汽车和职位发布在线资源,以便可以在集中式站点上检索感兴趣的广告。由于现有的广告网站会施加自己的结构和格式来存储和访问广告,因此生成统一的集成广告资源库并不是一件容易的事。面临的挑战包括(i)识别广告域,(ii)处理各个广告域中广告结构的多样性,以及(iii)分析每个广告域中具有不同含义的数据。为了解决这些问题,我们引入了ADEx,该工具依赖于各种机器学习方法来自动从在线广告中提取(非/半/全结构)数据,以创建通过域存储在基础数据库中的广告记录分类,关键字标记和有效属性值的标识。使用来自Craigslist,Ebay和KSL(.com)的18,000个在线广告的数据集生成的实验结果表明,与现有的文本分类,关键字标记和数据提取方法相比,ADEx的性能更高。进一步的评估证明,在将来自非结构化或(半)结构化源的数据映射到数据库记录中时,ADEx的性能优于或至少与当前最新的信息提取器一样好。

著录项

  • 来源
    《Information Systems》 |2013年第2期|183-197|共15页
  • 作者单位

    Computer Science Department, Brigham Young University, Provo, UT 84602, United States;

    Computer Science Department, Brigham Young University, Provo, UT 84602, United States;

    Computer Science Department, Brigham Young University, Provo, UT 84602, United States;

  • 收录信息 美国《科学引文索引》(SCI);美国《工程索引》(EI);
  • 原文格式 PDF
  • 正文语种 eng
  • 中图分类
  • 关键词

    data extraction; classification; keyword tagging; advertisement;

    机译:数据提取分类分类关键词广告;

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号