首页> 外文会议>Asian Language Processing, 2009. IALP '09 >BEST Corpus Development and Analysis
【24h】

BEST Corpus Development and Analysis

机译:BEST语料库开发与分析

获取原文

摘要

This document describes the development process of the BEST 2009 word segmented-corpus. It is the first corpus to benchmark Thai word segmentation software. The corpus is composed of four genres, namely, collection of news, novels, encyclopedia, and academic articles. It contains 509 files. Its length is 64.1 MB. There are 5,036,229 tokens with 83,027 unique tokens. Common tokens appearing in all genres are 4,556 tokens. They covered 85.13% of the corpus. The highest frequency token in the corpus is ¿¿¿ /thi2/. The first 50 frequency tokens cover 37.65% of the corpus. About 50% of the corpus compose of the first 119 high frequency tokens. All tokens are grouped into 8 categories. Except for Thai spelling category, the other categories play different major parts in specific genres.
机译:本文档介绍了BEST 2009词段语料库的开发过程。它是第一个以泰语分词软件为基准的语料库。语料库由新闻,小说,百科全书和学术文章集四种类型组成。它包含509个文件。它的长度是64.1 MB。有5,036,229个令牌和83,027个唯一令牌。所有流派中出现的常见标记是4,556个标记。它们覆盖了语料库的85.13%。语料库中频率最高的记号是/ thi2 /。前50个频率标记覆盖了语料库的37.65%。最初的119个高频标记中约有50%是语料库。所有令牌都分为8类。除泰文拼写类别外,其他类别在特定类型中扮演不同的主要角色。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号