Skip navigation
Skip navigation

A segmented topic model based on the two-parameter Poisson-Dirichlet process

Du, Lan; Buntine, Wray; Jin, Huidong (Warren)

Description

Documents come naturally with structure: a section contains paragraphs which itself contains sentences; a blog page contains a sequence of comments and links to related blogs. Structure, of course, implies something about shared topics. In this paper we take the simplest form of structure, a document consisting of multiple segments, as the basis for a new form of topic model. To make this computationally feasible, and to allow the form of collapsed Gibbs sampling that has worked well to date...[Show more]

dc.contributor.authorDu, Lan
dc.contributor.authorBuntine, Wray
dc.contributor.authorJin, Huidong (Warren)
dc.date.accessioned2015-12-07T22:16:56Z
dc.identifier.issn0885-6125
dc.identifier.urihttp://hdl.handle.net/1885/18292
dc.description.abstractDocuments come naturally with structure: a section contains paragraphs which itself contains sentences; a blog page contains a sequence of comments and links to related blogs. Structure, of course, implies something about shared topics. In this paper we take the simplest form of structure, a document consisting of multiple segments, as the basis for a new form of topic model. To make this computationally feasible, and to allow the form of collapsed Gibbs sampling that has worked well to date with topic models, we use the marginalized posterior of a two-parameter Poisson-Dirichlet process (or Pitman-Yor process) to handle the hierarchical modelling. Experiments using either paragraphs or sentences as segments show the method significantly outperforms standard topic models on either whole document or segment, and previous segmented models, based on the held-out perplexity measure.
dc.publisherKluwer Academic Publishers
dc.sourceMachine Learning
dc.subjectKeywords: Dirichlet process; Document structure; Gibbs sampling; Latent Dirichlet allocation; New forms; Topic model; Internet; Models Document structure; Latent Dirichlet allocation; Segmented topic model; Two-parameter Poisson-Dirichlet process
dc.titleA segmented topic model based on the two-parameter Poisson-Dirichlet process
dc.typeJournal article
local.description.notesImported from ARIES
local.identifier.citationvolume81
dc.date.issued2010
local.identifier.absfor080109 - Pattern Recognition and Data Mining
local.identifier.ariespublicationu3968803xPUB4
local.type.statusPublished Version
local.contributor.affiliationDu, Lan, College of Engineering and Computer Science, ANU
local.contributor.affiliationBuntine, Wray, College of Engineering and Computer Science, ANU
local.contributor.affiliationJin, Huidong (Warren), College of Engineering and Computer Science, ANU
local.description.embargo2037-12-31
local.bibliographicCitation.startpage5
local.bibliographicCitation.lastpage19
local.identifier.doi10.1007/s10994-010-5197-4
local.identifier.absseo970108 - Expanding Knowledge in the Information and Computing Sciences
dc.date.updated2016-02-24T10:21:26Z
local.identifier.scopusID2-s2.0-77955656991
local.identifier.thomsonID000280845700002
CollectionsANU Research Publications

Download

File Description SizeFormat Image
01_Du_A_segmented_topic_model_based_2010.pdf565.71 kBAdobe PDF    Request a copy


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  17 November 2022/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator