Investigation of the Quality of Topic Models for Noisy Data Sources

机译：嘈杂数据源主题模型的质量调查

获取原文

页面导航

摘要
著录项
相似文献
相关主题

摘要

Latent Dirichlet Allocation (LDA) has become the most stable and widely used topic model to derive topics from collections of documents where it depicts different levels of success based on diversified domains of inputs. Nevertheless, it is a vital requirement to evaluate the LDA against the quality of the input. The noise and uncertainty of the content create a negative influence on the topic model. The major contribution of this investigation is to critically evaluate the LDA based on the quality of input sources and human perception. The empirical study shows the relationship between the quality of the input and the accuracy of the output generated by LDA. Perplexity and coherence have been evaluated with three data-sets (RCV1, conference data set, tweets) which contain different level of complexities and uncertainty in their contents. Human perception in generating topics has been compared with the LDA in terms of human defined topics. Results of the analysis demonstrate a strong relationship between the quality of the input and generated topics. Thus, highly relevant topics were generated from formally written contents while noisy and messy contents lead to generate meaningless topics. A considerable gap is noticed between human defined topics and LDA generated topics. Finally, a concept-based topic modeling technique is proposed to improve the quality of topics by capturing the meaning of the content and eliminating the irrelevant and meaningless topics.

机译：潜在狄利克雷分配（LDA）已成为最稳定和广泛使用的主题模型，用于从文档集合中派生主题，该文档模型根据输入的不同领域描述了不同程度的成功。但是，对LDA的输入质量进行评估是至关重要的。内容的噪音和不确定性对主题模型产生负面影响。这项调查的主要贡献是根据输入源的质量和人类的感知来严格评估LDA。实证研究表明，LDA产生的输入质量和输出精度之间的关系。已使用三个数据集（RCV1，会议数据集，tweet）评估了困惑和连贯性，这三个数据集包含不同程度的复杂性和内容不确定性。就人类定义的话题而言，人们在产生话题方面的感知已与LDA进行了比较。分析结果表明，输入的质量和生成的主题之间有很强的关系。因此，从正式书面内容中生成了高度相关的主题，而嘈杂和凌乱的内容导致生成了毫无意义的主题。在人类定义的主题和LDA生成的主题之间发现了相当大的差距。最后，提出了一种基于概念的主题建模技术，通过捕获内容的含义并消除无关紧要的主题来提高主题的质量。

著录项

来源
《》|2018年|488-493|共6页
会议地点
作者
Yue Xu; Yuefeng Li; Dakshi T. Kapugamam Geeganage;
展开▼
作者单位

展开▼
会议组织
原文格式 PDF
正文语种
中图分类
关键词
Electrical engineering; Computer science; Coherence;

机译：电气工程;计算机科学;相干性;

相似文献

外文文献
中文文献
专利

1. Multi-Label Classification from Multiple Noisy Sources Using Topic Models ? [J] . Divya Padmanabhan, Satyanath Bhat, Shirish Shevade, Information . 2017,第2期

机译：使用主题模型从多个噪声源进行多标签分类？
2. Near-infrared monitoring of roller compacted ribbon density: Investigating sources of variation contributing to noisy spectral data [J] . Crowley Mary Ellen, Hegarty Avril, McAuliffe Michael A. P., European journal of pharmaceutical sciences . 2017,第期

机译：近红外线监测辊压缩带密度：调查变异源，有助于嘈杂的光谱数据
3. Domain-aware Mashup service clustering based on LDA topic model from multiple data sources [J] . Cao Buqing, Liu Xiaoqing (Frank), Liu Jianxun, Information and software technology . 2017,第octa期

机译：基于来自多个数据源的LDA主题模型的领域感知Mashup服务集群
4. Topic detection in noisy data sources [C] . Denecke Kerstin, Brosowski Marko Fifth International Conference on Digital Information Management . 2010

机译：嘈杂数据源中的主题检测
5. A prediction modeling framework for noisy welding quality data [D] . Park, Junheung 2015

机译：噪声焊接质量数据的预测建模框架
6. Robust heart rate estimation from multiple asynchronous noisy sources using signal quality indices and a Kalman filter [O] . Q Li, R G Mark, G D Clifford -1

机译：使用信号质量指标和卡尔曼滤波器从多个异步噪声源进行可靠的心率估计
7. Multi-Label Classification from Multiple Noisy Sources Using Topic Models [O] . Divya Padmanabhan, Satyanath Bhat, Shirish Shevade, 2017

机译：使用主题模型的多个噪声来源的多标签分类
8. Water-Quality Characteristics and Trends for Selected Sites in or Near the Earth Resources Observation Systems (EROS) Data Center, South Dakota, 1973-2000;Water-resources investigations rept [R] . Neitzert, K. M. 2003

机译：1973 - 2000年南达科他州地球资源观测系统（EROs）数据中心内或附近选定地点的水质特征和趋势;水资源调查报告

Investigation of the Quality of Topic Models for Noisy Data Sources

摘要

著录项

相似文献

相关主题

期刊订阅