首页> 外文会议>Advances in Information Systems >A 300 MB Turkish Corpus and Word Analysis
【24h】

A 300 MB Turkish Corpus and Word Analysis

机译:300 MB的土耳其语语料库和单词分析

获取原文
获取外文期刊封面目录资料

摘要

In order to determine some properties of a language, a corpus of that language should be created. To analyze Turkish language, at first, a Turkish corpus having ~300 MB capacity and more than 44 million words was prepared by using 10 different web sites having Turkish content. Most frequently used word statistics of Turkish were calculated by using this corpus. Frequencies of most frequently used first 7 words were compared with their equivalent in English, and it was found out that most frequently used words are not nouns in natural languages Most frequently used words having 1 to 5 letters were determined and they were applied onto a randomly selected text in order to test the validity of the process.
机译:为了确定某种语言的某些属性,应创建该语言的语料库。为了分析土耳其语,首先,使用10个具有土耳其语内容的网站,准备了一个具有约300 MB容量,超过4400万个单词的土耳其语语料。通过使用该语料库,可以计算出土耳其语中最常用的单词统计信息。将最常用的前7个单词的频率与英语中的等效单词进行比较,发现最常用的单词不是自然语言中的名词,确定了具有1-5个字母的最常用的单词并将它们随机应用于选择文本以测试该过程的有效性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号