...
首页> 外文期刊>ACM transactions on knowledge discovery from data >Probabilistic Modeling for Frequency Vectors Using a Flexible Shifted-Scaled Dirichlet Distribution Prior
【24h】

Probabilistic Modeling for Frequency Vectors Using a Flexible Shifted-Scaled Dirichlet Distribution Prior

机译:使用柔性移位缩放的Dirichlet分布频率向量的概率模型

获取原文
获取原文并翻译 | 示例
           

摘要

Burstiness and overdispersion phenomena of count vectors pose significant challenges in modeling such data accurately. While the dependency assumption of the multinomial distribution causes its failure to model frequency vectors in several machine learning and data mining applications, researchers found that by extending the multinomial distribution to the Dirichlet Compound multinomial (DCM), both phenomena modeling can be addressed. However, Dirichlet distribution is not the best choice, as a prior, given its negative-correlation and equal-confidence requirements. Thus, we propose to use a flexible generalization of the Dirichlet distribution, namely, the shifted-scaled Dirichlet, as a prior to the multinomial, which grants the model a capability to better fit real data, and we call the new model the Multinomial Shifted-Scaled Dirichlet (MSSD). Given that the likelihood function plays a key role in statistical inference, e.g., in maximum likelihood estimation and Fisher information matrix investigation, we propose to improve the efficiency of computing the MSSD log-likelihood by approximating its function based on Bernoulli polynomials where the log-likelihood function is computed using the proposed mesh algorithm. Moreover, given the sparsity and high-dimensionality nature of count vectors, we propose to improve its computation efficiency by approximating the novel MSSD as a member of the exponential family of distribution, which we call EMSSD. The clustering is based on mixture models, and for learning a model, selection approach is seamlessly integrated with the estimation of the parameters. The merits of the proposed approach are validated via challenging real-world applications such as hate speech detection in Twitter, real-time recognition of criminal action, and anomaly detection in crowded scenes. Results reveal that the proposed clustering frameworks offer a good compromise between other state-of-the-art techniques and outperform other approaches previously used for frequency vectors modeling. Besides, comparing to the MSSD, the approximation EMSSD has reduced the computational complexity in high-dimensional feature spaces.
机译:Count Vectors的突发和过度分散现象准确地构建了这些数据的显着挑战。虽然多项分布的依赖假设导致其失败在多种机器学习和数据挖掘应用中模拟频率向量,但研究人员发现,通过将多聚体分布扩展到Dirichlet化合物多项式(DCM),可以解决这两种现象建模。然而,鉴于其负相关性和相应符合要求,Dirichlet分布不是​​最佳选择。因此,我们建议使用Dirichlet分布的灵活性,即移位缩放的Dirichlet作为多项式之前的Dirichlet,这将模型能够更好地拟合实际数据,并且我们称之为多项移位的新模型-scaled dirichlet(MSD)。鉴于似然函数在统计推断中发挥着关键作用,例如,在最大似然估计和fisher信息矩阵调查中,我们建议通过基于伯努利多项式的伯尔努利多​​项式来提高计算MSD日志可能性的效率。使用所提出的网格算法计算似然函数。此外,鉴于计数向量的稀疏性和高度性质,我们建议通过将小说MSD作为指数分布的成员来提高其计算效率,我们称之为EMSD。群集基于混合模型,并且为了学习模型,选择方法与参数的估计无缝集成。拟议方法的优点是通过挑战现实世界应用验证,例如在Twitter中的仇恨语音检测,犯罪行动的实时识别和拥挤场景中的异常检测。结果表明,所提出的聚类框架在其他最先进的技术与以前用于频率向量的其他方法之间提供了良好的折衷。此外,与MSD相比,近似EMSD在高维特征空间中降低了计算复杂性。

著录项

相似文献

  • 外文文献
  • 中文文献
  • 专利
获取原文

客服邮箱:kefu@zhangqiaokeyan.com

京公网安备:11010802029741号 ICP备案号:京ICP备15016152号-6 六维联合信息科技 (北京) 有限公司©版权所有
  • 客服微信

  • 服务号