Focused Crawling using Dictionary Algorithm with Breadth First and by Page Length Methods for Javanese and Sundanese Corpus Construction

机译：用卓越的爪哇和阳光语料库建设的宽度第一和宽容的覆盖用字典算法爬行

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

The need of complete corpus nowadays is very crucial, especially for linguist. In order to assist linguist to construct corpus, a tool for collecting text in a specific language from the Internet is needed. This paper describes an approach to collecting Javanese and Sundanese text from the Internet. We have modified a focused crawler named WebSPHINX such that it can be useful for crawling the text. In order to determine which pages are crawled, the focused crawler needs a language classifier. In this research, we used the dictionary algorithm for classifying the text. In order to determine the next links to visit, we employed 2 crawling methods, i.e. Breadth First and By Page Length. The purpose of our research is to observe how the algorithm and the crawling methods perform to collect Javanese and Sundanese text from the Internet. Our experiments have shown that the dictionary algorithm classify the text based on the languages with average accuracy of 88,64% depending on the size of the documents being classified. The experiments also showed that in general the Breadth First method outperfoms the By Page Length method. In this research, we also campared the dictionary algorithm to the N-Gram algorithm when different crawling methods are employed. The experiments showed that the combination of Breadth First method and Dictionary algorithm generally outperforms other combinations. Therefore, we used the combination of Breadth First method and Dictionary algorithm for crawling the text and then constructing Javanese and Sundanese corpora.

机译：现在，完整语料库的需要非常重要，特别是语言学家。为了帮助语言学家构建语料库，需要一种从互联网中收集特定语言的文本的工具。本文介绍了一种从互联网收集爪哇和太阳丹文本的方法。我们修改了一个名为WebSphinx的聚焦爬虫，使得它可以很有用来爬行文本。为了确定哪些页面爬出，聚焦爬虫需要语言分类器。在这项研究中，我们使用了用于对文本进行分类的字典算法。为了确定访问的下一个链接，我们使用了2条爬行方法，即宽度，并按页面长度。我们研究的目的是观察算法和爬行方法如何从互联网收集爪哇和南日文本。我们的实验表明，字典算法根据分类文件的大小，基于平均精度的语言对文本进行分类，这是88,64％。实验还表明，一般来说，广度第一方法越来越逐页长度方法。在本研究中，当采用不同的爬行方法时，我们还将字典算法举办到N-GRAM算法。实验表明，广度第一方法和字典算法的组合通常优于其他组合。因此，我们使用广度的第一个方法和字典算法的组合来爬行文本，然后构建爪哇和孙达斯语。

著录项

来源
《International Conference on Electrical Engineering and Informatics》|2014年||共7页
会议地点
作者
William Eka Putra; Saiful Akbar;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类电波传播、传播机理;
关键词
breadth first; by page length; corpus; dictionary; evaluation; focused crawler.;

机译：宽度第一;通过页面长度;语料库;字典;评估;聚焦履带。;

相似文献

外文文献
中文文献
专利

1. Spoken Language Identification with Phonotactics Methods on Minangkabau, Sundanese, and Javanese Languages [J] . Nur Endah Safitri, Amalia Zahra, Mirna Adriani Procedia Computer Science . 2016,第1期

机译：南部语言，Sun语和爪哇语言上的语音方法识别口语
2. THE TRADE-OFF BETWEEN QUANTITY AND QUALITY. COMPARING A LARGE CRAWLED CORPUS AND A SMALL FOCUSED CORPUS FOR MEDICAL TERMINOLOGY EXTRACTION [J] . Hoste Veronique, Vanopstal Klaar, Terryn Ayla Rigouts, Nature reviews neuroscience . 2019,第2期

机译：数量和质量之间的权衡。比较大型爬行的语料库和用于医学术语提取的小型专注语料库
3. Focused Web Crawling Algorithms [J] . Andas Amrin, Chunlei Xia, Shuguang Dai Journal of Computers . 2015,第4期

机译：聚焦的Web爬行算法
4. Focused Crawling using Dictionary Algorithm with Breadth First and by Page Length Methods for Javanese and Sundanese Corpus Construction [C] . William Eka Putra, Saiful Akbar International Conference on Electrical Engineering and Informatics . 2014

机译：用卓越的爪哇和阳光语料库建设使用宽度和逐步爬行用字典算法爬行
5. A novel hybrid focused crawling algorithm to build domain-specific collections. [D] . Chen, Yuxin. 2007

机译：一种新颖的混合重点爬网算法，用于构建特定于域的集合。
6. Exploration of a Capability-Focused Aerospace System of Systems Architecture Alternative with Bilayer Design Space Based on RST-SOM Algorithmic Methods [O] . Zhifei Li, Dongliang Qin, Feng Yang -1

机译：基于RST-SOM算法的以双层设计空间为中心的以系统架构替代能力为重点的航空航天系统的探索
7. Focused Crawling using Dictionary Algorithm with Breadth First and by Page Length Methods for Javanese and Sundanese Corpus Construction [O] . Putra William Eka, Akbar Saiful 2013

机译：使用字典算法的广度优先和页面长度方法进行集中爬网，以实现Javanese和Sundanese语料库的构建
8. Phrase Dictionary Construction Methods for the R2 Information Retrieval System [R] . Jansen, J. M. 1969

机译：R2信息检索系统的短语词典构造方法

Focused Crawling using Dictionary Algorithm with Breadth First and by Page Length Methods for Javanese and Sundanese Corpus Construction

摘要

著录项

相似文献

相关主题

期刊订阅