Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information

机译：具有部分语音信息的网络级中文单词N-gram语料库的开发

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Web provides a large-scale corpus for researchers to study the language usages in real world. Developing a web-scale corpus needs not only a lot of computation resources, but also great efforts to handle the large variations in the web texts, such as character encoding in processing Chinese web texts. In this paper, we aim to develop a web-scale Chinese word N-gram corpus with parts of speech information called NTU PN-Gram corpus using the ClueWeb09 dataset. We focus on the character encoding and some Chinese-specific issues. The statistics about the dataset is reported. We will make the resulting corpus a public available resource to boost the Chinese language processing.

机译：Web为研究人员提供了大规模的语料库，以研究现实世界中的语言用法。开发Web规模的语料库不仅需要大量的计算资源，而且还需要付出巨大的努力来处理Web文本中的大量变化，例如处理中文Web文本中的字符编码。在本文中，我们的目标是使用ClueWeb09数据集开发具有语音信息部分的网络级中文单词N-gram语料库，称为NTU PN-Gram语料库。我们专注于字符编码和一些中文相关的问题。报告有关数据集的统计信息。我们将使所得的语料库成为公共可用资源，以促进中文处理。

著录项

来源
《International conference on language resources and evaluation》|2012年|320-324|共5页
会议地点
作者
Chi-Hsin Yu; Yi-jie Tang; Hsin-Hsi Chen;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
ClueWeb09; encoding detection; part-of-speech n-grams;

机译：ClueWeb09;编码检测;语音词组;

相似文献

外文文献
中文文献
专利

1. Speech Recognition Using Function-Word N-Grams and Content-Word N-Grams [J] . Ryosuke ISOTANI, Shoichi MATSUNAGA, Shigeki SAGAYAMA IEICE Transactions on Information and Systems . 1995,第6期

机译：使用功能词N语法和内容词N语法的语音识别
2. Oxymoron generation using an association word corpus and a large-scale N-gram corpus [J] . Yamane Hiroaki, Hagiwara Masafumi Soft computing: A fusion of foundations, methodologies and applications . 2015,第4期

机译：使用关联词语料库和大规模N-gram语料库生成Oxymoron
3. Dealing with Out-of vocabulary Words and Filled Pauses in Word N-gram Based Speech Recognition System [J] . ATSUHIKO KAI, YOSHIFUMI HIROSE, SEIICHI NAKAGAWA 情報処理学会論文誌 . 1999,第4期

机译：基于单词N-gram的语音识别系统处理词汇外单词和填充的暂停
4. Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information [C] . Chi-Hsin Yu, Yi-jie Tang, Hsin-Hsi Chen International conference on language resources and evaluation . 2012

机译：使用言语信息的部分开发网络级中文单词n-gram语料库
5. Moving Pictures, Empty Words: Cinema as Developmental Interface in the Chinese Reconstruction, 1932-1952 [D] . Chen, Hongwei. 2017

机译：移动图片，空词：电影作为中国重建的发展界面，1932-1952
6. A fine-grained Chinese word segmentation and part-of-speech tagging corpus for clinical text [O] . Ying Xiong, Zhongmin Wang, Dehuan Jiang, 2019

机译：用于临床文本的细粒度中文分词和词性标注语料库
7. The Research of the Maximum Length n-grams Priority Chinese Word Segmentation Method Based on Corpus Type Frequency Information [O] . Pengyu Lu, Lijun Jin, Bin Jiang 2012

机译：基于语料库型频率信息的最大长度N-GRAMS优先级汉语分割方法的研究

Development of a Web-Scale Chinese Word N-gram Corpus with Parts of Speech Information

摘要

著录项

相似文献

相关主题

期刊订阅