Skip navigation
Skip navigation

Non-parametric bayesian methods for structured topic models

Du, Lan

Description

The proliferation of large electronic document archives requires new techniques for automatically analysing large collections, which has posed several new and interesting research challenges. Topic modelling, as a promising statistical technique, has gained significant momentum in recent years in information retrieval, sentiment analysis, images processing, etc. Besides existing topic models, the field of topic modelling still needs to be further explored using more powerful tools. One...[Show more]

dc.contributor.authorDu, Lan
dc.date.accessioned2018-11-22T00:04:08Z
dc.date.available2018-11-22T00:04:08Z
dc.date.copyright2011
dc.identifier.otherb3095318
dc.identifier.urihttp://hdl.handle.net/1885/149800
dc.description.abstractThe proliferation of large electronic document archives requires new techniques for automatically analysing large collections, which has posed several new and interesting research challenges. Topic modelling, as a promising statistical technique, has gained significant momentum in recent years in information retrieval, sentiment analysis, images processing, etc. Besides existing topic models, the field of topic modelling still needs to be further explored using more powerful tools. One potentially useful area is to directly consider the document structure ranging from semantically high-level segments (e.g., chapters, sections, or paragraphs) to low-level segments (e.g., sentences or words) in topic modeling. This thesis introduces a family of structured topic models for statistically modeling text documents together with their intrinsic document structures. These models take advantage of non-parametric Bayesian techniques (e.g., the two-parameter Poisson-Dirichlet process (PDP)) and Markov chain Monte Carlo methods. Two preliminary contributions of this thesis are 1. The Compound Poisson-Dirichlet process (CPDP): it is an extension of the PDP that can be applied to multiple input distributions. 2. Two Gibbs sampling algorithms for the PDP in a finite state space: these two samplers are based on the Chinese restaurant process that provides an elegant analogy of incremental sampling for the PDP. The first, a two-stage Gibbs sampler, arises from a table multiplicity representation for the PDP. The second is built on top of a table indicator representation. In a simply controlled environment of multinomial sampling, the two new samplers have fast convergence speed. These support the major contribution of this thesis, which is a set of structured topic models: Segmented Topic Model (STM) which models a simple document structure with a four-level hierarchy by mapping the document layout to a hierarchical subject structure. It performs significantly better than the latent Dirichlet allocation model and other segmented models at predicting unseen words. Sequential Latent Dirichlet Allocation (SeqLDA) which is motivated by topical correlations among adjacent segments (i.e., the sequential document structure). This new model uses the PDP and a simple first-order Markov chain to link a set of LDAs together. It provides a novel approach for exploring the topic evolution within each individual document. Adaptive Topic Model (AdaTM) which embeds the CPDP in a simple directed acyclic graph to jointly model both hierarchical and sequential document structures. This new model demonstrates in terms of per-word predictive accuracy and topic distribution profile analysis that it is beneficial to consider both forms of structures in topic modelling. - provided by Candidate.
dc.format.extentxix, 166 leaves.
dc.language.isoen_AU
dc.rightsAuthor retains copyright
dc.subject.lcshMathematical statistics Data processing.
dc.subject.lcshNatural language processing (Computer science)
dc.subject.lcshInformation storage and retrieval systems Data processing.
dc.subject.lcshMarkov processes Mathematical models
dc.titleNon-parametric bayesian methods for structured topic models
dc.typeThesis (PhD)
local.description.notesThesis (Ph.D.)--Australian National University
dc.date.issued2011
local.type.statusAccepted Version
local.contributor.affiliationAustralian National University.
local.identifier.doi10.25911/5d5fcb84aa3c8
dc.date.updated2018-11-20T00:31:02Z
dcterms.accessRightsOpen Access
local.mintdoimint
CollectionsOpen Access Theses

Download

File Description SizeFormat Image
b30953182_Du_Lan.pdf44.09 MBAdobe PDFThumbnail


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  17 November 2022/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator