首页> 外文会议>International Conference on Advanced Computing and Applications >A Cross-Checking Based Method for Fraudulent Detection on E-Commercial Crawling Data
【24h】

A Cross-Checking Based Method for Fraudulent Detection on E-Commercial Crawling Data

机译:基于交叉检查的电子商务爬网数据欺诈性检测方法

获取原文

摘要

Marketing research through collecting data from e-commercial websites comes with latent risks of receiving inaccurate data which have been modified before they are returned, especially when the crawling processes are conducted by other service providers. The risk of data being modified is often dismissed in related research works of web crawling systems. Avoiding this problem requires an examination phase where the data are collected for the second time for comparisons. However, the cost for re-crawling processes to simply examine all the data is significant as it will double the original cost. In this paper, we introduce an efficient approach to choose potential data which are most likely to have been modified for later re-crawling processes. By this approach, we can reduce the cost for examining, but still guarantee the data achieve their authenticity. We then measure the efficiency of our scheme while testing the ability to detect fraudulent data in a dataset containing simulated modified data. Results show that our scheme can reduce considerably the amount of data to be re-crawled but still cover most of the fraudulent data. As an example, by applying our scheme to select the data to be re-crawled from a real-world e-commercial website, with a set in which fraudulent data occupy 50 percentages, we only need to re-collect 50 percentages of total data to detect up to 80 percentages of fraudulent data, which is clearly more efficient than choosing randomly the same amount of data to be re-crawled. We conclude by discussing the accuracy measurement of the proposed model.
机译:通过从电子商务网站收集数据进行的营销研究存在潜在的风险,即在返回数据之前已修改了不正确的数据,尤其是在由其他服务提供商进行爬网过程时。 Web爬网系统的相关研究工作通常消除了修改数据的风险。避免此问题需要检查阶段,在该阶段第二次收集数据以进行比较。但是,重新爬网过程以简单检查所有数据的成本非常高,因为它将使原始成本增加一倍。在本文中,我们介绍了一种有效的方法来选择潜在数据,这些数据最有可能在以后的重新爬网过程中进行了修改。通过这种方法,我们可以降低检查成本,但仍然可以保证数据达到真实性。然后,我们在测试包含模拟修改数据的数据集中检测欺诈性数据的能力的同时,测量该方案的效率。结果表明,我们的方案可以大大减少要重新爬网的数据量,但仍然可以覆盖大多数欺诈数据。例如,通过应用我们的方案从真实的电子商务网站中选择要重新抓取的数据(其中欺诈性数据占50%的集合),我们只需要重新收集总数据的50%最多可检测80%的欺诈数据,这显然比随机选择要重新爬网的相同数量的数据更有效。我们通过讨论所提出模型的准确性度量来得出结论。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号