首页> 外文会议>IEEE Advanced Information Technology, Electronic and Automation Control Conference >Research on the Construction Method of Chinese - Vietnamese Parallel Corpus
【24h】

Research on the Construction Method of Chinese - Vietnamese Parallel Corpus

机译:汉越平行语料库的构建方法研究。

获取原文

摘要

The Chinese-Vietnameseparallel corpus is the basic research problem in the fields of natural language processing. The traditional methods use the DOM tree or element anchors in HTML extract parallel sentences with low accuracy and slow alignment speed. Therefore, this paper proposes a new Web-based Chinese-Vietnamese parallel corpus construction scheme. The scheme will determine the parallel web page through the LDA (Latent Dirichlet Allocation) and Gibbs Sampling. And the BeautifulSoup and regular expression will be used to crawl the webpage text and clean the corpus. The DOM tree and the element anchors in HTML are used to optimize the extraction of parallel sentence pairs. Combined with the sentence length and Champollion algorithm, the dynamic programming algorithm is adopted to improve the correct rate and recall rate of sentence alignment. The program successfully established a million-level Chinese-Vietnamese parallel corpus.
机译:汉越平行语料库是自然语言处理领域的基础研究问题。传统方法使用HTML中的DOM树或元素锚来提取平行语句,但准确性较低且对齐速度较慢。因此,本文提出了一种新的基于Web的汉越平行语料库建设方案。该方案将通过LDA(潜在Dirichlet分配)和Gibbs采样来确定并行网页。而且BeautifulSoup和正则表达式将用于对网页文本进行爬网并清理语料库。 HTML中的DOM树和元素锚用于优化并行句子对的提取。结合句子长度和Champollion算法,采用动态规划算法提高句子对齐的正确率和召回率。该计划成功建立了百万级的中越平行语料库。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号