首页> 外文会议>Information Technology: Coding and Computing, 2004. Proceedings. ITCC 2004. International Conference on >Turkish word n-gram analyzing algorithms for a large scale Turkish corpus - TurCo
【24h】

Turkish word n-gram analyzing algorithms for a large scale Turkish corpus - TurCo

机译:大规模土耳其语料库的土耳其语n-gram分析算法-TurCo

获取原文

摘要

To calculate some statistical properties of a language, first you need to take some samples of that language. That sample is called a corpus. An unbalanced large scale Turkish text corpus (TurCo) having /spl sim/362 MB capacity and more than 50 million words was prepared by using 12 different resources including Web sites and novels in Turkish language. Different algorithms were tested to obtain the n-gram (1/spl les/spl les/5) values. Efficiencies of different algorithms have been examined by applying them onto the each piece of the corpus one by one. Only detailed results of the two algorithms created without using database tables are given, because all the other algorithms need to run more than one day which makes those tests inefficient.
机译:要计算某种语言的某些统计属性,首先需要获取该语言的一些样本。该样本称为语料库。通过使用12种不同的资源(包括网站和土耳其语小说),准备了一个具有/ spl sim / 362 MB容量和超过5000万个单词的不平衡的大规模土耳其文本语料库(TurCo)。测试了不同的算法以获得n-gram(1 / spl les / n / spl les / 5)值。通过将算法应用到语料库的每一部分,已经研究了不同算法的效率。仅给出了在不使用数据库表的情况下创建的两种算法的详细结果,因为所有其他算法都需要运行超过一天的时间,这会使这些测试效率低下。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号