首页> 外文期刊>ACM transactions on the web >A Comprehensive Study of Techniques for URL-Based Web Page Language Classification
【24h】

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

机译:基于URL的网页语言分类技术的综合研究

获取原文
获取原文并翻译 | 示例

摘要

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest Fl-measure for English (94) and the highest Fl-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the "wrong" language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an Fl-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.
机译:仅给出网页的URL,我们可以识别其语言吗?在本文中,我们探讨了这个问题。当网页的内容不可用或下载内容浪费带宽和时间时,基于URL的语言分类很有用。通过应用各种算法和功能,我们为英语,德语,法语,西班牙语和意大利语构建了基于URL的语言分类器。作为算法,我们使用了广泛用于文本分类的机器学习算法和用于文本语言识别的最新算法。作为特征,我们使用单词,各种大小的n-gram和定制特征(我们的新颖特征集)。我们将我们的方法与两种基准方法进行了比较,即按国家/地区代码顶级域进行分类和按托管Web服务器的IP地址进行分类。我们在从Open Directory Project和查询商业搜索引擎获得的数据集的10倍交叉验证设置中训练和测试了分类器。我们以分类器的最佳表现获得了英语中最低的Fl量度(94)和德语中最高的Fl量度(98)。我们还评估了方法的性能:(i)用Adobe Flash编写的一组Web页面上,以及(ii)作为以语言为中心的爬网程序的一部分。在第一种情况下,很难提取网页的内容,而在第二种情况下,下载“错误”语言的页面会浪费带宽。在这两种设置中,最佳分类器的Fl度量精度很高,Adobe Flash页面的Fl度量值在95(英语)至98(意大利语)之间,语言的精度在90(对于意大利语)至97(法语)之间。专注的爬虫。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号