Non-parametric bayesian methods for structured topic models

Du, Lan

Non-parametric bayesian methods for structured topic models

Date

2011

Authors

Du, Lan

Abstract

The proliferation of large electronic document archives requires new techniques for automatically analysing large collections, which has posed several new and interesting research challenges. Topic modelling, as a promising statistical technique, has gained significant momentum in recent years in information retrieval, sentiment analysis, images processing, etc. Besides existing topic models, the field of topic modelling still needs to be further explored using more powerful tools. One potentially useful area is to directly consider the document structure ranging from semantically high-level segments (e.g., chapters, sections, or paragraphs) to low-level segments (e.g., sentences or words) in topic modeling. This thesis introduces a family of structured topic models for statistically modeling text documents together with their intrinsic document structures. These models take advantage of non-parametric Bayesian techniques (e.g., the two-parameter Poisson-Dirichlet process (PDP)) and Markov chain Monte Carlo methods. Two preliminary contributions of this thesis are 1. The Compound Poisson-Dirichlet process (CPDP): it is an extension of the PDP that can be applied to multiple input distributions. 2. Two Gibbs sampling algorithms for the PDP in a finite state space: these two samplers are based on the Chinese restaurant process that provides an elegant analogy of incremental sampling for the PDP. The first, a two-stage Gibbs sampler, arises from a table multiplicity representation for the PDP. The second is built on top of a table indicator representation. In a simply controlled environment of multinomial sampling, the two new samplers have fast convergence speed. These support the major contribution of this thesis, which is a set of structured topic models: Segmented Topic Model (STM) which models a simple document structure with a four-level hierarchy by mapping the document layout to a hierarchical subject structure. It performs significantly better than the latent Dirichlet allocation model and other segmented models at predicting unseen words. Sequential Latent Dirichlet Allocation (SeqLDA) which is motivated by topical correlations among adjacent segments (i.e., the sequential document structure). This new model uses the PDP and a simple first-order Markov chain to link a set of LDAs together. It provides a novel approach for exploring the topic evolution within each individual document. Adaptive Topic Model (AdaTM) which embeds the CPDP in a simple directed acyclic graph to jointly model both hierarchical and sequential document structures. This new model demonstrates in terms of per-word predictive accuracy and topic distribution profile analysis that it is beneficial to consider both forms of structures in topic modelling. - provided by Candidate.