首页> 外文会议>Asian Language Processing, 2009. IALP '09 >BEST Corpus Development and Analysis

【24h】

BEST Corpus Development and Analysis

机译：BEST语料库开发与分析

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

This document describes the development process of the BEST 2009 word segmented-corpus. It is the first corpus to benchmark Thai word segmentation software. The corpus is composed of four genres, namely, collection of news, novels, encyclopedia, and academic articles. It contains 509 files. Its length is 64.1 MB. There are 5,036,229 tokens with 83,027 unique tokens. Common tokens appearing in all genres are 4,556 tokens. They covered 85.13% of the corpus. The highest frequency token in the corpus is Â¿Â¿Â¿ /thi2/. The first 50 frequency tokens cover 37.65% of the corpus. About 50% of the corpus compose of the first 119 high frequency tokens. All tokens are grouped into 8 categories. Except for Thai spelling category, the other categories play different major parts in specific genres.

机译：本文档介绍了BEST 2009词段语料库的开发过程。它是第一个以泰语分词软件为基准的语料库。语料库由新闻，小说，百科全书和学术文章集四种类型组成。它包含509个文件。它的长度是64.1 MB。有5,036,229个令牌和83,027个唯一令牌。所有流派中出现的常见标记是4,556个标记。它们覆盖了语料库的85.13％。语料库中频率最高的记号是/ thi2 /。前50个频率标记覆盖了语料库的37.65％。最初的119个高频标记中约有50％是语料库。所有令牌都分为8类。除泰文拼写类别外，其他类别在特定类型中扮演不同的主要角色。

著录项

来源
《Asian Language Processing, 2009. IALP '09 》|2009年|322-327|共6页
会议地点 Singapore(SG);Singapore(SG)
作者
Boriboon Monthika; Kriengket Kanyanut; Chootrakool Patcharika; Phaholphinyo Sitthaa; Purodakananda Sumonmas; Thanakulwarapas Tipraporn; Kosawat Krit;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Thai language; corpus annotation; word-segmented corpus;

机译：泰语;语料库注释;分词语料库;

相似文献

外文文献
中文文献
专利

1. Semantic Similarity Analysis for Corpus Development and Paraphrase Detection in Arabic [J] . Mahmoud Adnen, Zrigui Mounir The international arab journal of information technology . 2021 ,第1期

机译：阿拉伯语中语料库发育和解释检测的语义相似性分析
2. Method Development for Multimodal Data Corpus Analysis of Expressive Instrumental Music Performance [J] . Federico Ghelli Visi, Stefan ?stersj?, Robert Ek, Frontiers in Psychology . 2020 ,第a期

机译：多峰数据语料库的方法开发表达仪器音乐性能的分析
3. Method Development for Multimodal Data Corpus Analysis of Expressive Instrumental Music Performance [J] . Visi Federico Ghelli, stersj Stefan, Ek Robert, Frontiers in Psychology . 2020 ,第2期

机译：多模式数据语料库的方法开发富有乐曲仪器音乐性能的分析
4. A large synchronous corpus as monitoring corpus: Some comparative content analysis of Chinese and Japanese language developments [C] . Proceedings of 2010 4th International Universal Communication Symposium . 2010

机译：大型同步语料库作为监控语料库：汉日语言发展的比较内容分析
5. A contrastive corpus analysis between modern art criticism and photography criticism for curriculum development in art ESP. [D] . Hullender, Arthur. 2014

机译：现代艺术批评与摄影批评在ESP课程开发中的对比语料库分析。
6. Method Development for Multimodal Data Corpus Analysis of Expressive Instrumental Music Performance [O] . Federico Ghelli Visi, Stefan Östersjö, Robert Ek, 2020

机译：多模式数据语料库的方法开发富有乐曲仪器音乐性能的分析
7. Towards Zulu corpus clean-up, lexicon development and corpus annotation by means of computational morphological analysis [O] . Bosch Sonja E., Pretorius Laurette 2011

机译：通过计算形态学分析，进行祖鲁语语料库的清理，词典开发和语料库注释

BEST Corpus Development and Analysis

摘要

著录项

相似文献

相关主题

期刊订阅