ANU Open Research Repository has been upgraded. We are still working on a few minor issues, which may result in short outages throughout the day. Please get in touch with repository.admin@anu.edu.au if you experience any issues.
 

Large-alphabet sequence modelling - a comparative study

Date

2014

Authors

Shao, Wen

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Most raw data is not binary, but over some often large and structured alphabet. Sometimes it is convenient to deal with binarised data sequence, but typically exploiting the original structure of the data significantly improves performance in many practical applications. In this thesis, we study Martin-Lof random sequences that are maximally incompressible and provide a topological view on the size of the set of random sequences. We also investigate the relationship between binary data compression techniques and modelling natural language text with the latter using raw unbinarised data sequence from a large alphabet. We perform an experimental comparative study for them, including an empirical comparison between Kneser-Ney (KN) variants with regular Context Tree Weighting algorithm (CTW) and phase CTW, and with large-alphabet CTW with different estimators. We also apply the idea of Hutter's adaptive sparse Dirichlet-multinomial coding to the KN method and provide a heuristic to make the discounting parameter adaptive. The KN with this adaptive discounting parameter outperforms the traditional KN method on the Large Calgary corpus.

Description

Keywords

Citation

Source

Type

Thesis (MPhil)

Book Title

Entity type

Access Statement

License Rights

DOI

10.25911/5d51415ba1671

Restricted until