...
首页> 外文期刊>Expert systems with applications >Website categorization: A formal approach and robustness analysis in the case of e-commerce detection
【24h】

Website categorization: A formal approach and robustness analysis in the case of e-commerce detection

机译:网站分类:电子商务检测情况下的正式方法和鲁棒性分析

获取原文
获取原文并翻译 | 示例
           

摘要

Website categorization has recently emerged as a very important task in several contexts. A huge amount of information is freely available through websites, and it could be used for example to accomplish statistical surveys, saving in costs. However, the information of interest for the specific categorization has to be mined among that huge amount. This turns out to be a difficult task in practice. In this work we propose a practically viable procedure to perform website categorization, based on the automatic generation of data records summarizing the content of each entire website. This is obtained by using web scraping and optical character recognition, followed by a number of nontrivial text mining and feature engineering steps. When such records have been produced, we use classification algorithms to categorize the websites according to the aspect of interest We compare in this task Convolutional Neural Networks, Support Vector Machines, Random Forest and Logistic classifiers. Since in many practical cases the training set labels are physiologically noisy, we analyze the robustness of each technique with respect to the presence of misclassified training records. We present results on real-world data for the problem of the detection of websites providing e-commerce facilities, however our approach is not structurally limited to this case. (C) 2019 Elsevier Ltd. All rights reserved.
机译:网站分类最近在几个语境中被出现为一个非常重要的任务。通过网站自由地提供大量信息,可用于实现统计调查,从而节省成本。但是,必须在大量分类中开采对特定分类的感兴趣的信息。这在实践中成为一项艰巨的任务。在这项工作中,我们提出了一种实际上可行的程序来执行网站分类,基于自动生成总结每个整个网站的内容的数据记录。这是通过使用Web刮擦和光学字符识别来获得的,然后获得多个非活动文本挖掘和特征工程步骤。当已经产生了这些记录时,我们使用分类算法根据我们在此任务卷积神经网络中比较的感兴趣的方面对网站进行分类,支持向量机,随机林和物流分类器。由于在许多实际情况下,培训集标签在生理学上嘈杂,我们对存在错误分类培训记录的存在的鲁棒性。我们对现实世界数据提出了对提供电子商务设施的网站的问题的结果,但我们的方法在结构上没有结构上限于这种情况。 (c)2019 Elsevier Ltd.保留所有权利。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号