Searchable words on the Web

Hugh E. Williams; Justin Zobel

首页> 外文期刊>International journal on digital libraries >Searchable words on the Web

【24h】

Searchable words on the Web

机译：网络上的可搜索词

获取原文

获取原文并翻译 | 示例

掌桥外文数据库（机构版） >>

开具论文收录证明 >>

页面导航

摘要
著录项
相似文献
相关主题

摘要

In designing data structures for text databases, it is valuable to know how many different words are likely to be encountered in a particular collection. For example, vocabulary accumulation is central to index construction for text database systems; it is useful to be able to estimate the space requirements and performance characteristics of the main-memory data structures used for this task. However, it is not clear how many distinct words will be found in a text collection or whether new words will continue to appear after inspecting large volumes of data. We propose practical definitions of a word and investigate new word occurrences under these models in a large text collection. We inspected around two billion word occurrences in 45 GB of World Wide Web documents and found just over 9.74 million different words in 5.5 million documents; overall, 1 word in 200 was new. We observe that new words continue to occur, even in very large datasets, and that choosing stricter definitions of what constitutes a word has only limited impact on the number of new words found.

机译：在设计文本数据库的数据结构时，了解特定集合中可能会遇到多少个不同的单词非常有价值。例如，词汇积累对于文本数据库系统的索引构建至关重要。能够估计用于此任务的主内存数据结构的空间要求和性能特征很有用。但是，尚不清楚在文本集合中将找到多少个不同的单词，或者在检查大量数据后是否会继续出现新单词。我们提出一个单词的实际定义，并在大型文本集中的这些模型下研究新单词的出现。我们检查了45 GB的万维网文档中大约20亿个单词，并在550万个文档中发现了974万多个单词。总体而言，200个单词中有1个是新单词。我们观察到，即使在非常大的数据集中，新单词仍会继续出现，并且选择更严格的单词构成定义只会对发现的新单词数量产生有限的影响。

著录项

来源
《International journal on digital libraries》 |2005年第2期|p.99-105|共7页
作者
Hugh E. Williams; Justin Zobel;
展开▼
作者单位

Department of Computer Science, RMIT University, GPO Box 2476V, Melbourne 3001, Australia;

展开▼
收录信息
原文格式 PDF
正文语种 eng
中图分类图书馆学、图书馆事业;计算技术、计算机技术;
关键词
web search; terms; word occurrences; indexing;

机译：网络搜索;术语;单词出现;索引;
入库时间 2022-08-18 02:09:05

相似文献

外文文献
中文文献
专利

1. An Integrated Approach for Measuring Semantic Similarity between Words and Sentences using Web Search Engine [J] . Adhikesavan Kavitha The international arab journal of information technology . 2015,第6期

机译：使用Web搜索引擎测量单词和句子之间语义相似度的集成方法
2. Stability-mutation feature identification of Web search keywords based on keyword concentration change ratio [J] . Hongtao, LU, Guanghui, 中国文献情报：英文版 . 2014,第003期

机译：基于关键词集中度变化比的Web搜索关键词稳定性变异特征识别
3. Computing Semantic Similarity Measure Between Words Using Web Search Engine [J] . Pushpa C N, Girish S, Nitin S K, Computer Science & Information Technology . 2013,第5期

机译：使用Web搜索引擎计算单词之间的语义相似度测量
4. Linggle: a Web-scale Linguistic Search Engine for Words in Context [C] . Joanne Boisson, Ting-Hui Kao, Jian-Cheng Wu, Annual meeting of the Association for Computational Linguistics . 2013

机译：Linggle：用于上下文中单词的Web规模语言搜索引擎
5. PREDICTING LETTER SEARCH TIME THROUGH WORDS AND NONWORDS: THE ROLES OF STATISTICAL FREQUENCY AND LEXICAL STATUS IN THE WORD-SUPERIORITY EFFECT [D] . DUTCH, SUSAN ELAINE. 1980

机译：通过单词和单词预测字母搜索时间：单词超常效果中统计频率和词汇状态的作用
6. A Web Search Method Based on the Temporal Relation of Query Keywords [O] . Tomoyo Kage, Kazutoshi Sumiya -1

机译：基于查询关键词时间关系的Web搜索方法
7. About JEPA Editorial Board Aim and Scope Publication Ethics Reviewer Acknowledgement Website Statistic User You are logged in as... mahfudlotulula My Profile Log Out Article Tools Print this article Indexing metadata How to cite item Finding References Journal Content Search Search Scope Browse By Issue By Author By Title Information For Readers For Authors For Librarians Information for Author Author Guidelines Online Submission Guidelines Index Google Scholar Search logo Crossref Metadata Search RESEARCHBIB Index Search BASE Metadata Search DRJI Index Search PKP Index Search PKP Index Search Onesearch Metadata Search Citeulike Index Search Citeulike Index Search CiteFactor Index Search Sinta Index Search Garuda Index Search Garuda Index Search Tools Mendeley Metadata Search logo Turnitin Metadata Search logo Zotero Metadata Search logo Keywords CPO, efisiensi teknis, teknologi, TFP Contract farming, logit, partisipasi, petani kopi Daya saing, Ekspor, Kinerja, Kopi FSCN Faktor penentu, keputusan pembelian, cabai rawit, regresi logistik. Hidroponik, Kegiatan Produksi, HOR, Manajemen Risiko Industri Kopi Niat Berwirausaha Berbasis Komoditas Pertanian, Restorasi Gambut, SEM Pengukuran Kinerja Pertanian Alami Risiko, Produksi, Musim Hujan dan Musim Kemarau, Usahatani Bawang Merah SCOR Salassae Self Help Subsidi pupuk, Pertanian Indonesia, Pengeluaran subsidi, Utang subsidi. agrowisata, krisan, SWOT, pengembangan kompetensi, kepemimpinan, motivasi, lingkungan kerja, kinerja karyawan perilaku petani, padi, organik permintaan, proyeksi, pangan hewani, Indonesia. pertanian organik, pupuk organik padat, efisiensi biaya rantai pasok Strategi Pengembangan Industri Kecil Tahu Solo di Desa Punge Blang Cut Kecamatan Meuraxa Kota Banda Aceh [O] . Muhammad Purba, Lukman Hakim, Muhammad Wardhana 2020

机译：关于JEPA编辑委员会瞄准和范围出版物伦理审稿人确认网站统计用户您已登录为... Mahfudlotulula我的个人资料注销文章工具打印本文索引元数据如何引用项目查找参考日记内容搜索范围浏览作者通过读者的标题信息，为提交人提供了作者作者作者指南在线提交指南指数谷歌学者搜索徽标CrossRef元数据搜索索引搜索基础元数据搜索DRJI索引搜索PKP索引搜索PKP索引搜索Osearch元数据搜索索引搜索Citeulike索引搜索CiteFactor索引搜索辛塔索引搜索嘉鲁达索引搜索嘉鲁达索引搜索工具Mendeley元数据搜索标志Turnitin的元数据搜索标志Zotero只元数据搜索标志关键词CPO，efisiensi teknis，TEKNOLOGI，TFP订单农业，对数，partisipasi，大年科皮大雅saing，Ekspor，Kinerja，麝香FSCN FAKTOR PENENTU ，Keputusan Pembelian，Cabai Rawit，Regresi Logistik。 Hidroponik，Kegiatan Produksi，Hor，Manajemen Risiko Industri Kopi Niat Berwirausaha Berbasis Komoditas Pertanian，Restorasi Gambut，SEM Pengukuran Kinerja Pertanian Alami Risiko，Produksi，Produksi，Musim Hujan Dan Musim Kemarau，Usahatani Bawang Merah Scor Salassae自助子女Pupuk，Pertanian Indonesia，Pengeluaran子女，Utang子女。 Agrowisata，Krisan，Swot，Pengembangan Kompetensi，Kepemimpinan，Motivasi，Lingkungan Kerja，Kinerja Karyawan Perilaku Petani，Padi，Outsikik Permintaan，Proyeksi，印度尼西亚州河湾河畔普通湾普恩岛。 Pertanian Organik，Pupuk Organik Padat，Efisiensi Biaya Rantai Pasok Strategi Pengembangan Industri Kecil Tahu Solo di Desa Purege Blang Cut Kecamatan Meuraxa Kota Banda Aceh
8. Emperor's New Password Manager: Security Analysis of Web-based Password Managers. [R] . Li, Z., He, W., Akhawa, D., 2014

机译：Emperor的新密码管理器：基于Web的密码管理器的安全性分析。

Searchable words on the Web

摘要

著录项

相似文献

相关主题

期刊订阅