【24h】

Searchable words on the Web

机译:网络上的可搜索词

获取原文
获取原文并翻译 | 示例
       

摘要

In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.
机译:在设计文本数据库的数据结构时,了解特定集合中可能会遇到多少个不同的单词非常有价值。例如,词汇积累对于文本数据库系统的索引构建至关重要。能够估计用于此任务的主内存数据结构的空间要求和性能特征很有用。但是,尚不清楚在文本集合中将找到多少个不同的单词,或者在检查大量数据后是否会继续出现新单词。我们提出一个单词的实际定义,并在大型文本集中的这些模型下研究新单词的出现。我们检查了45 GB的万维网文档中大约20亿个单词,并在550万个文档中发现了974万多个单词。总体而言,200个单词中有1个是新单词。我们观察到,即使在非常大的数据集中,新单词仍会继续出现,并且选择更严格的单词构成定义只会对发现的新单词数量产生有限的影响。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号