Symbolic data are distributions constructed from data points. When big datasets can be organised into different groups, one may first summarise each group by a symbol, and then analyse the symbolic dataset directly. By reducing the dataset to a more manageable size, it enables explanatory analysis and statistical inference, which would be impossible for the original large dataset. In the first half of this thesis, we develop a probabilistic approach for constructing likelihood functions for two types of symbolic data, interval-valued data and histogram-valued data. Existing methods ignore the process by which the symbolic data are constructed; namely by the aggregation of real-valued data generated from some underlying process. We develop the foundation of likelihood-based statistical inference for random symbols that directly incorporates the underlying generative procedure into the analysis. It permits the direct fitting of models for the underlying real-valued data given only symbolic summaries. Our approach overcomes several problems associated with existing methods, and can jointly model intra- and inter- symbol variations. The new methods are illustrated by simulated and real data analyses.Latent variable models are powerful tools to extract unobserved features from big datasets. Well-known examples are latent Dirichlet allocation (LDA) and hierarchical Dirichlet process mixtures (HDP-M) for topic modelling. Collapsed Gibbs samplers are routinely used for Bayesian inference for these models due to their superior performance in chain mixing. In the second half of the thesis, we propose a blocking scheme for a collapsed Gibbs sampler for the LDA and HDP-M models which can improve chain mixing efficiency. For the LDA model, we develop an O(log K)-step nested sampling (K is the number of topics) to directly simulate latent variables for each block. To obtain such blocking scheme for the HDP-M model, we introduce residual allocation processes (RAP), which can construct random partitions induced from Dirichlet processes in a class-wise manner, and propose hierarchical RAP for constructing random partitions induced from HDP. Derived from residual allocation constructions, the blocking scheme consists of nested sampling of the latent variables for existing topics and residual allocation sampling of the latent variables for new topics. We demonstrate that the blocking scheme achieves substantial improvements in chain mixing and a significant reduction in computation time.
展开▼