首页> 外文会议>Proceedings of the Twenty-third international conference on very large data bases >Using taxonomy, discriminants, and signatures for navigating in text databases
【24h】

Using taxonomy, discriminants, and signatures for navigating in text databases

机译:使用分类法,判别式和签名在文本数据库中导航

获取原文
获取原文并翻译 | 示例

摘要

We explore how to organize a text database hierarchically to aid better searching and browsing. We propose to exploit the natural hierarchy of topics, or taxonomy, that many corpora, such as internet directories, digital libraries, and patent databases enjoy. In our system, the user navigates through the query response not as a flat unstructured list, but embedded in the familiar taxonomy, and annotated with document signatures computed dynamically with respect to where the user is located at any time. We show how to update such databases with new documents with high speed and accuracy. We use techniques from statistical pattern recognition to efficiently separate the feature words or discriminants from the noise words at each node of the taxonomy. Using these, we build a multi-level classifier. At each node, this classifier can ignore the large number of noise words in a document. Thus the classifier has a small model size and is very fast. However, owing to the use of context-sensitive features, it classifier is very accurate. We report on experiences with the Reuters newswire benchmark, the US Patent database, and web document samples from Yahoo!.
机译:我们探索如何分层组织文本数据库,以帮助更好地进行搜索和浏览。我们建议利用许多语料库(例如Internet目录,数字图书馆和专利数据库)所享有的主题或分类法的自然层次。在我们的系统中,用户在查询响应中导航的方式不是平整的非结构化列表,而是嵌入到熟悉的分类法中,并使用相对于用户随时随地动态计算的文档签名进行注释。我们展示了如何使用新文档以高速,准确的方式更新此类数据库。我们使用统计模式识别中的技术来有效地将分类词中每个节点的特征词或判别词与噪声词分开。使用这些,我们构建了一个多级分类器。在每个节点上,此分类器可以忽略文档中大量的干扰词。因此,分类器具有小的模型尺寸并且非常快。但是,由于使用了上下文相关功能,因此它的分类器非常准确。我们报告了有关路透社新闻基准,美国专利数据库以及Yahoo!的网络文档样本的经验。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号