首页> 外文会议>International Conference on Application of Information and Communication Technologies >Text Analysis Case Study: Determining Word Frequency based on Azerbaijan top 500 websites.
【24h】

Text Analysis Case Study: Determining Word Frequency based on Azerbaijan top 500 websites.

机译:文本分析案例研究:基于Azerbaijan前500个网站确定词频。

获取原文

摘要

Word Frequency Distribution (WFD) is one the most important sub-areas of Natural Language Processing (NLP) and Computational Linguistic. The reliability and quality of WFD results are highly dependent on the size and quality of the corpora. In this paper describes the ongoing project with aim to build a corpus Azerbaijani text AzWebCorpus. Top 500 websites in Azerbaijan are used as a text source for corpus building. Most of essential tools including Web Crawler, Text Cleaner, Tokenizer have been developed and several opensource tools have been used. Moreover, AzWebCorpus compared to another corpus AzBookCorpus built on text taken from electronic books in terms of word frequency. Same approach that used in this paper is applicable for other languages.
机译:Word频率分布(WFD)是自然语言处理(NLP)和计算语言最重要的子区域。 WFD结果的可靠性和质量高度依赖于语料库的大小和质量。在本文中,介绍了正在进行的项目,其目的是构建语料库Azerbaijani Text Azwebcorpus。 Azerbaijan的前500个网站被用作语料库建筑的文本源。已经开发了大多数基本工具,包括Web爬虫,文本清洁器,销验牌器,并使用了几种OpenSource工具。此外,Azwebcorpus与另一个语料库Azbookcorpus相比,建立在从文字频率的电子书中拍摄的文本。本文使用的方法适用于其他语言。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号