Skip navigation
Skip navigation

Large-alphabet sequence modelling - a comparative study

Shao, Wen

Description

Most raw data is not binary, but over some often large and structured alphabet. Sometimes it is convenient to deal with binarised data sequence, but typically exploiting the original structure of the data significantly improves performance in many practical applications. In this thesis, we study Martin-Lof random sequences that are maximally incompressible and provide a topological view on the size of the set of random sequences. We also investigate the relationship between binary data...[Show more]

dc.contributor.authorShao, Wen
dc.date.accessioned2019-02-18T23:45:19Z
dc.date.available2019-02-18T23:45:19Z
dc.date.copyright2014
dc.identifier.otherb3579067
dc.identifier.urihttp://hdl.handle.net/1885/156315
dc.description.abstractMost raw data is not binary, but over some often large and structured alphabet. Sometimes it is convenient to deal with binarised data sequence, but typically exploiting the original structure of the data significantly improves performance in many practical applications. In this thesis, we study Martin-Lof random sequences that are maximally incompressible and provide a topological view on the size of the set of random sequences. We also investigate the relationship between binary data compression techniques and modelling natural language text with the latter using raw unbinarised data sequence from a large alphabet. We perform an experimental comparative study for them, including an empirical comparison between Kneser-Ney (KN) variants with regular Context Tree Weighting algorithm (CTW) and phase CTW, and with large-alphabet CTW with different estimators. We also apply the idea of Hutter's adaptive sparse Dirichlet-multinomial coding to the KN method and provide a heuristic to make the discounting parameter adaptive. The KN with this adaptive discounting parameter outperforms the traditional KN method on the Large Calgary corpus.
dc.format.extent80 leaves.
dc.subject.lcshCharacter sets (Data processing)
dc.subject.lcshHeuristic algorithms
dc.subject.lcshBig data
dc.titleLarge-alphabet sequence modelling - a comparative study
dc.typeThesis (MPhil)
local.contributor.supervisorHutter, Marcus
local.description.notesThesis (M.Phil.)--Australian National University, 2014.
dc.date.issued2014
local.contributor.affiliationAustralian National University. Research School of Computer Science
local.identifier.doi10.25911/5d51415ba1671
dc.date.updated2019-01-10T08:26:26Z
dcterms.accrualMethodANU Deposit Copy; Received: 20140909
local.mintdoimint
CollectionsOpen Access Theses

Download

File Description SizeFormat Image
b35790672-Shao_W.pdf62.81 MBAdobe PDFThumbnail


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  19 May 2020/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator