When one sample is not enough

机译：当一个样本不足时

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Database selection is an important step when searching over large numbers of distributed text databases. The database selection task relies on statistical summaries of the database contents, which are not typically exported by databases. Previous research has developed algorithms for constructing an approximate content summary of a text database from a small document sample extracted via querying. Unfortunately, Zipf's law practically guarantees that content summaries built this way for any relatively large database will fail to cover many low-frequency words. Incomplete content summaries might negatively affect the database selection process, especially for short queries with infrequent words. To improve the coverage of approximate content summaries, we build on the observation that topically similar databases tend to have related vocabularies. Therefore, the approximate content summaries of topically related databases can complement each other and increase their coverage. Specifically, we exploit a (given or derived) hierarchical categorization of the databases and adapt the notion of "shrinkage" -a form of smoothing that has been used successfully for document classification-to the content summary construction task. A thorough evaluation over 315 real web databases as well as over TREC data suggests that the shrinkage-based content summaries are substantially more complete than their "unshrunk" counterparts. We also describe how to modify existing database selection algorithms to adaptively decide -at run-time-whether to apply shrinkage for a query. Our experiments, which rely on TREC data sets, queries, and the associated "relevance judgments," show that our shrinkage-based approach significantly improves state-of-the-art database selection algorithms, and also outperforms a recently proposed hierarchical strategy that exploits database classification as well.

机译：在大量分布式文本数据库中进行搜索时，数据库选择是重要的一步。数据库选择任务依赖于数据库内容的统计摘要，这些摘要通常不由数据库导出。先前的研究已经开发了用于从通过查询提取的小文档样本中构建文本数据库的近似内容摘要的算法。不幸的是，Zipf的定律实际上保证了以这种方式为任何相对较大的数据库构建的内容摘要将无法覆盖许多低频词。内容摘要不完整可能会对数据库选择过程产生负面影响，尤其是对于单词很少的简短查询。为了提高近似内容摘要的覆盖范围，我们基于以下观察结果：局部相似的数据库往往具有相关的词汇表。因此，局部相关数据库的近似内容摘要可以相互补充，并增加其覆盖范围。具体来说，我们利用数据库的（给定或派生）分层分类，并将“收缩”的概念（一种已成功用于文档分类的平滑形式）适应内容摘要构建任务。对315个真实Web数据库以及TREC数据的全面评估表明，基于收缩的内容摘要比其“未收缩”的内容摘要要完整得多。我们还将描述如何修改现有的数据库选择算法，以在运行时自适应地决定是否对查询应用收缩。我们的实验依靠TREC数据集，查询和相关的“相关性判断”，表明我们基于收缩的方法显着改善了最新的数据库选择算法，并且优于最近提出的利用数据库分类也是如此。

著录项

来源
《ACM SIGMOD international conference on Management of data》|2004年|P.767-778|共12页
会议地点
作者
Panagiotis G. Ipeirotis; Luis Gravano;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类 TP274.23;
关键词

相似文献

外文文献
中文文献
专利

1. An Exercise in Sampling: The Effect of Sample Size and Number of Samples on Sampling Error [J] . Juan M. Sanchez World Journal of Chemical Education . 2016,第2期

机译：抽样练习：样本数量和样本数量对抽样误差的影响
2. Representative sampling size for strip sampling and number of required samples for random sampling for soil nutrients in direct seeded fields [J] . Hu Wei, Schoenau Jeff J., Si Bing C. Precision Agriculture . 2015,第4期

机译：带状抽样的代表性抽样规模和直接播种田中土壤养分随机抽样所需样品的数量
3. Sampling Methodologies for Epidemiologic Surveillance of Men Who Have Sex with Men and Transgender Women in Latin America: An Empiric Comparison of Convenience Sampling, Time Space Sampling, and Respondent Driven Sampling [J] . Clark J. L., Konda K. A., Silva-Santisteban A., AIDS and behavior . 2014,第12期

机译：拉丁美洲男男性行为者和变性女性的流行病学监测抽样方法：便利抽样，时空抽样和受访者驱动抽样的经验比较
4. How to Estimate Statistical Characteristics Based on a Sample: Nonparametric Maximum Likelihood Approach Leads to Sample Mean,Sample Variance, etc. [C] . Vladik Kreinovich, Thongchai Dumrongpokaphan . 2018

机译：如何基于样本估计统计特征：非参数最大似然法导致样本均值，样本方差等。
5. Stratified Inverse Cluster Sampling with Updating Process for Samples from a Rare Population [D] . Kim, Sewon. 2020

机译：分层逆簇采样，具有稀有群体的样本的更新过程
6. Sampling Methodologies for Epidemiologic Surveillance of Men Who Have Sex with Men and Transgender Women in Latin America: An Empiric Comparison of Convenience Sampling Time Space Sampling and Respondent Driven Sampling [O] . J. L. Clark, K. A. Konda, A. Silva-Santisteban, -1

机译：拉丁美洲男男性行为者和变性女性的流行病学监测抽样方法：便捷抽样时空抽样和受访者驱动抽样的经验比较
7. This article presents a new numerical model describing the behaviour of a thermally thick wood sample exposed to high solar heat flux (above 1 MW/m2). A preliminary study based on dimensionless numbers is used to classify the problem and support model building assumptions. Then, a model based on mass, momentum and energy balance equations is proposed. These equations are coupled with liquid-vapour drying model and pseudo species biomass degradation model. By comparing to a former experimental study, preliminary results have shown that these equations are not enough to accurately predict biomass behaviour under high solar heat flux. Indeed, a char layer acting as radiative shield forms on the sample exposed surface. In addition to this classical set of equations, it is mandatory to take into account radiation penetration into the medium. Furthermore, as biomass contains water, medium deformation consecutively to char steam gasification must also be implemented. Finally, with the addition of these two strategies, the model is able to properly capture the degradation of biomass when exposed to high radiative heat flux over a range of sample initial moisture content. Additional insights of biomass behaviour under high solar heat flux were also derived. Drying, pyrolysis and gasification fronts are present at the same time inside of the sample. The coexistence of these three thermochemical fronts leads to char gasification by the steam produced from drying of the sample, which it is the main phenomenon behind medium ablation. [O] . Pozzobon, Victor, Salvador, Sylvain, Bézian, Jean Jacques 2018

机译：本文提供了一个新的数值模型，该模型描述了暴露于高太阳热通量（高于1 / MW / m2）的热厚木材样品的行为。基于无量纲数的初步研究用于对问题进行分类并支持模型构建假设。然后，提出了一种基于质量，动量和能量平衡方程的模型。这些方程式与液体蒸汽干燥模型和假物种生物质降解模型耦合。通过与以前的实验研究进行比较，初步结果表明，这些方程不足以准确预测高太阳热通量下的生物量行为。的确，在样品暴露的表面上形成了充当辐射屏蔽层的炭层。除了这套经典的方程式之外，还必须考虑到辐射向介质的渗透。此外，由于生物质中含有水，因此还必须在炭蒸气汽化后进行连续的介质变形。最后，通过添加这两种策略，该模型能够在一定范围的样品初始水分含量下暴露于高辐射热通量的情况下，正确捕获生物质的降解。还得出了在高太阳热通量下生物量行为的其他见解。样品内部同时存在干燥，热解和气化前沿。这三个热化学前沿的共存会导致样品干燥产生的蒸汽产生焦炭气化，这是介质烧蚀的主要现象。

When one sample is not enough

摘要

著录项

相似文献

相关主题

期刊订阅