首页> 外文期刊>MATEC Web of Conferences >Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method
【24h】

Spelling Correction for Text Documents in Bahasa Indonesia Using Finite State Automata and Levinshtein Distance Method

机译:使用有限状态自动机和Levenshtein距离方法对印度尼西亚语中的文本文档进行拼写校正

获取原文
       

摘要

Any mistake in writing of a document will cause the information to be told falsely. These days, most of the document is written with a computer. For that reason, spelling correction is needed to solve any writing mistakes. This design process discuss about the making of spelling correction for document text in Indonesian language with document's text as its input and a .txt file as its output. For the realization, 5 000 news articles have been used as training data. Methods used includes Finite State Automata (FSA), Levenshtein distance, and N-gram. The results of this designing process are shown by perplexity evaluation, correction hit rate and false positive rate. Perplexity with the smallest value is a unigram with value 1.14. On the other hand, the highest percentage of correction hit rate is bigram and trigram with value 71.20 %, but bigram is superior in processing time average which is 01:21.23 min. The false positive rate of unigram, bigram, and trigram has the same percentage which is 4.15 %. Due to the disadvantages at using FSA method, modification is done and produce bigram's correction hit rate as high as 85.44 %.
机译:书面文档中的任何错误都将导致错误地告知信息。如今,大多数文档都是用计算机编写的。因此,需要纠正拼写以解决任何书写错误。此设计过程讨论了如何使用印度尼西亚文本作为输入,并使用.txt文件作为输出,以印度尼西亚语对文档文本进行拼写校正的问题。为了实现这一目标,已将5 000条新闻文章用作培训数据。使用的方法包括有限状态自动机(FSA),Levenshtein距离和N-gram。通过困惑度评估,校正命中率和误报率来显示此设计过程的结果。值为最小的困惑是值为1.14的字母组合。另一方面,校正命中率的最高百分比是双字组和三字组,值为71.20%,但是双字组在处理时间平均值(01:21.23分钟)方面更胜一筹。 unigram,bigram和trigram的误报率具有相同的百分比,为4.15%。由于使用FSA方法的缺点,因此进行了修改,从而使bigram的校正命中率高达85.44%。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号