首页> 外文学位 >Semantics-based language models for information retrieval and text mining.

【24h】

Semantics-based language models for information retrieval and text mining.

机译：基于语义的语言模型，用于信息检索和文本挖掘。

获取原文

获取原文并翻译 | 示例

页面导航

摘要
著录项
相似文献
相关主题

摘要

The language modeling approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smoothing techniques. In the thesis, we propose a novel context-sensitive semantic smoothing method referred to as a topic signature language model. It extracts explicit topic signatures from a document and then statistically maps them into individual words in the vocabulary. In order to support the new language model, we developed two automated algorithms to extract multiword phrases and ontological concepts, respectively, and an EM-based algorithm to learn semantic mapping knowledge from co-occurrence data. The topic signature language model is applied to three applications: information retrieval, text classification, and text clustering. The evaluations on news collection and biomedical literature prove the effectiveness of the topic signature language model.;In the experiment of information retrieval, the topic signature language model consistently outperforms the baseline two-stage language model as well as the context-insensitive semantic smoothing method in all configurations. It also beats the state-of-the-art Okapi models in all configurations. In the experiment of text classification, when the size of training documents is small, the Bayesian classifier with semantic smoothing not only outperforms the classifiers with background smoothing and Laplace smoothing, but it also beats the active learning classifiers and SVM classifiers. On the task of clustering, whether or not the dataset to cluster is small, the model-based k-means with semantic smoothing performs significantly better than both the model-based k-means with background smoothing and Laplace smoothing. It is also superior to the spherical k-means in terms of effectiveness.;In addition, we empirically prove that, within the framework of topic signature language models, the semantic knowledge learned from one collection could be effectively applied to other collections. In the thesis, we also compare three types of topic signatures (i.e., words, multiword phrases, and ontological concepts), with respect to their effectiveness and efficiency for semantic smoothing. In general, it is more expensive to extract multiword phrases and ontological concepts than individual words, but semantic mapping based on multiword phrases and ontological concepts are more effective in handling data sparsity than on individual words.

机译：语言建模方法着重于通过选择适当的语言模型以及平滑技术来估计准确模型的问题。本文提出了一种新的上下文敏感语义平滑方法，称为主题签名语言模型。它从文档中提取明确的主题签名，然后将它们统计地映射到词汇表中的单个单词中。为了支持新的语言模型，我们开发了两种自动算法来分别提取多词短语和本体论概念，并开发了一种基于EM的算法来从同现数据中学习语义映射知识。主题签名语言模型应用于三个应用程序：信息检索，文本分类和文本聚类。对新闻采集和生物医学文献的评估证明了主题签名语言模型的有效性。在信息检索实验中，主题签名语言模型始终优于基线两阶段语言模型以及上下文无关的语义平滑方法在所有配置中。在所有配置中，它也击败了最新的Okapi模型。在文本分类实验中，当训练文档的大小较小时，具有语义平滑的贝叶斯分类器不仅优于具有背景平滑和拉普拉斯平滑的分类器，而且还击败了主动学习分类器和SVM分类器。在聚类的任务上，无论要聚类的数据集是否较小，具有语义平滑的基于模型的k均值均比具有背景平滑和基于Laplace平滑的基于模型的k均值表现更好。在有效性方面，它也优于球形k均值。此外，我们经验证明，在主题签名语言模型的框架内，从一个集合中学到的语义知识可以有效地应用于其他集合。在论文中，我们还比较了三种主题签名（即单词，多词短语和本体概念），以及它们对语义平滑的有效性和效率。通常，提取多单词短语和本体概念要比单个单词昂贵，但是基于多单词短语和本体概念的语义映射在处理数据稀疏性方面比单个单词更有效。

著录项

作者
Zhou, Xiaohua.;
展开▼
作者单位

Drexel University.;

展开▼
授予单位 Drexel University.;
学科 Computer Science.
学位 Ph.D.
年度 2008
页码 155 p.
总页数 155
原文格式 PDF
正文语种 eng
中图分类
关键词
入库时间 2022-08-17 11:38:35

相似文献

外文文献
中文文献
专利

1. Text data management and analysis: a practical introduction to information retrieval and text mining. [J] . Xiannong Meng Computing reviews . 2017,第4期

机译：文本数据管理和分析：信息检索和文本挖掘的实用介绍。
2. Text data management and analysis: a practical introduction to information retrieval and text mining. [J] . Dimitrios Katsaros Computing reviews . 2017,第3期

机译：文本数据管理和分析：信息检索和文本挖掘的实用介绍。
3. Extracting Chemical Reactions from Thai Text for Semantics-Based Information Retrieval [J] . Peerasak INTARAPAIBOON, Ekawit NANTAJEEWARAWAT, Thanaruk THEERAMUNKONG IEICE Transactions on Information and Systems . 2011,第3期

机译：从泰语文本中提取化学反应以基于语义的信息检索
4. Extracting Chemical Reactions from Thai Text for Semantics-Based Information Retrieval [C] . Peerasak Intarapaiboon, Ekawit Nantajeewarawat, Thanaruk Theeramunkong ACIIDS;Asian conference on intelligent information and database systems;International conference on intelligent information and database systems . 2010

机译：从泰语文本中提取化学反应以进行基于语义的信息检索
5. Text association mining with cross-sentence inference, structure-based document model and multi-relational text mining. [D] . Thaicharoen, Supphachai. 2009

机译：带有跨句推理的文本关联挖掘，基于结构的文档模型和多关系文本挖掘。
6. Text Categorization Models for High-Quality Article Retrieval in Internal Medicine [O] . Yindalon Aphinyanaphongs, Ioannis Tsamardinos, Alexander Statnikov, 2005

机译：内科高质量文章检索的文本分类模型
7. Multi-lingual text retrieval and mining. [O] . 2003

机译：multi-lingual text retrieval and mining.
8. Natural Language Text Retrieval Using a Large Semantic Network [R] . Nelson, P. 1993

机译：利用大型语义网络进行自然语言文本检索

Semantics-based language models for information retrieval and text mining.

摘要

著录项

相似文献

相关主题

期刊订阅