Traditional models of distributional semantics suffer from computational issues such as data sparsity for individual lexemes and complexities of modeling semantic composition when dealing with structures larger than single lexical items. In this work, we present a frequency-driven paradigm for robust distributional semantics in terms of semantically cohesive lineal constituents, or motifs. The framework subsumes issues such as differential compositional as well as non-compositional behavior of phrasal con-situents, and circumvents some problems of data sparsity by design. We design a segmentation model to optimally partition a sentence into lineal constituents, which can be used to define distributional contexts that are less noisy, semantically more interpretable, and linguistically dis-ambiguated. Hellinger PCA embeddings learnt using the framework show competitive results on empirical tasks.
展开▼