首页> 外文会议>International Conference on Electrical Engineering and Informatics >Focused Crawling using Dictionary Algorithm with Breadth First and by Page Length Methods for Javanese and Sundanese Corpus Construction
【24h】

Focused Crawling using Dictionary Algorithm with Breadth First and by Page Length Methods for Javanese and Sundanese Corpus Construction

机译:用卓越的爪哇和阳光语料库建设的宽度第一和宽容的覆盖用字典算法爬行

获取原文

摘要

The need of complete corpus nowadays is very crucial, especially for linguist. In order to assist linguist to construct corpus, a tool for collecting text in a specific language from the Internet is needed. This paper describes an approach to collecting Javanese and Sundanese text from the Internet. We have modified a focused crawler named WebSPHINX such that it can be useful for crawling the text. In order to determine which pages are crawled, the focused crawler needs a language classifier. In this research, we used the dictionary algorithm for classifying the text. In order to determine the next links to visit, we employed 2 crawling methods, i.e. Breadth First and By Page Length. The purpose of our research is to observe how the algorithm and the crawling methods perform to collect Javanese and Sundanese text from the Internet. Our experiments have shown that the dictionary algorithm classify the text based on the languages with average accuracy of 88,64% depending on the size of the documents being classified. The experiments also showed that in general the Breadth First method outperfoms the By Page Length method. In this research, we also campared the dictionary algorithm to the N-Gram algorithm when different crawling methods are employed. The experiments showed that the combination of Breadth First method and Dictionary algorithm generally outperforms other combinations. Therefore, we used the combination of Breadth First method and Dictionary algorithm for crawling the text and then constructing Javanese and Sundanese corpora.
机译:现在,完整语料库的需要非常重要,特别是语言学家。为了帮助语言学家构建语料库,需要一种从互联网中收集特定语言的文本的工具。本文介绍了一种从互联网收集爪哇和太阳丹文本的方法。我们修改了一个名为WebSphinx的聚焦爬虫,使得它可以很有用来爬行文本。为了确定哪些页面爬出,聚焦爬虫需要语言分类器。在这项研究中,我们使用了用于对文本进行分类的字典算法。为了确定访问的下一个链接,我们使用了2条爬行方法,即宽度,并按页面长度。我们研究的目的是观察算法和爬行方法如何从互联网收集爪哇和南日文本。我们的实验表明,字典算法根据分类文件的大小,基于平均精度的语言对文本进行分类,这是88,64%。实验还表明,一般来说,广度第一方法越来越逐页长度方法。在本研究中,当采用不同的爬行方法时,我们还将字典算法举办到N-GRAM算法。实验表明,广度第一方法和字典算法的组合通常优于其他组合。因此,我们使用广度的第一个方法和字典算法的组合来爬行文本,然后构建爪哇和孙达斯语。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号