首页> 外文期刊>ACM Transactions on Information Systems >Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages
【24h】

Comparing the Archival Rate of Arabic, English, Danish, and Korean Language Web Pages

机译:比较阿拉伯文,英文,丹麦文和韩文网页的存档率

获取原文
获取原文并翻译 | 示例
           

摘要

It has long been suspected that web archives and search engines favor Western and English language webpages. In this article, we quantitatively explore how well indexed and archived Arabic language webpages are as compared to those from other languages. We began by sampling 15,092 unique URIs from three different website directories: DMOZ (multilingual), Raddadi, and Star28 (the last two primarily Arabic language). Using language identification tools, we eliminated pages not in the Arabic language (e.g., English-language versions of Aljazeera pages) and culled the collection to 7,976 Arabic language webpages. We then used these 7,976 pages and crawled the live web and web archives to produce a collection of 300,646 Arabic language pages. We compared the analysis of Arabic language pages with that of English, Danish, and Korean language pages. First, for each language, we sampled unique URIs from DMOZ; then, using language identification tools, we kept only pages in the desired language. Finally, we crawled the archived and live web to collect a larger sample of pages in English, Danish, or Korean. In total for the four languages, we analyzed over 500,000 webpages. We discovered: (1) English has a higher archiving rate than Arabic, with 72.04% archived. However, Arabic has a higher archiving rate than Danish and Korean, with 53.36% of Arabic URIs archived, followed by Danish and Korean with 35.89% and 32.81% archived, respectively. (2) Most Arabic and English language pages are located in the United States; only 14.84% of the Arabic URIs had an Arabic country code top-level domain (e.g., .sa) and only 10.53% had a GeoIP in an Arabic country. Most Danish-language pages were located in Denmark, and most Korean-language pages were located in South Korea. (3) The presence of a webpage in a directory positively impacts indexing and presence in the DMOZ directory, specifically, positively impacts archiving in all four languages. In this work, we show that web archives and search engines favor English pages. However, it is not universally true for all Western-language webpages because, in this work, we show that Arabic webpages have a higher archival rate than Danish language webpages.
机译:长期以来一直怀疑网络档案和搜索引擎偏爱西方和英语的网页。在本文中,我们定量研究了与其他语言相比,阿拉伯语网页的索引和存档情况如何。我们首先从三个不同的网站目录中采样了15,092个唯一URI:DMOZ(多语言),Raddadi和Star28(后两种主要是阿拉伯语言)。使用语言识别工具,我们淘汰了非阿拉伯语的页面(例如,英语版本的半岛电视台的页面),并将该集合选为7,976个阿拉伯语网页。然后,我们使用了这些7,976页,并抓取了实时网络和网络档案,以生成300,646个阿拉伯语页面的集合。我们将阿拉伯语页面与英语,丹麦语和朝鲜语页面的分析进行了比较。首先,对于每种语言,我们从DMOZ中采样了唯一的URI。然后,使用语言识别工具,我们仅保留所需语言的页面。最后,我们抓取了存档的实时网络,以英语,丹麦语或韩语收集了更大的页面样本。对于这四种语言,我们总共分析了超过500,000个网页。我们发现:(1)英语的存档率高于阿拉伯语,存档率为72.04%。但是,阿拉伯文的存档率高于丹麦文和韩文,存档的阿拉伯文URI占53.36%,其次是丹麦文和韩文,分别存档了35.89%和32.81%。 (2)大多数阿拉伯语和英语页面位于美国;只有14.84%的阿拉伯URI具有阿拉伯国家/地区代码顶级域名(例如.sa),只有10.53%的阿拉伯国家/地区具有GeoIP。大部分丹麦语页面位于丹麦,而大多数朝鲜语页面位于韩国。 (3)目录中网页的存在对DMOZ目录中的索引编制和存在产生积极影响,特别是对所有四种语言的归档产生积极影响。在这项工作中,我们证明了网络档案和搜索引擎偏爱英语页面。但是,并非所有的西语网页都普遍适用,因为在这项工作中,我们证明阿拉伯语网页的存档率比丹麦语网页高。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号