Automatic extraction of topic hierarchies based on WordNet
|Collections||Digital Humanities Australasia: Building, Mapping, Connecting (2012)|
|Title:||Automatic extraction of topic hierarchies based on WordNet|
|Publisher:||Australasian Association for Digital Humanities|
|Citation:||Brey, G. & Vieira, M. (March 2012). Automatic extraction of topic hierarchies based on WordNet. Presentation at the Digital Humanities Australasia 2012: Building, Mapping, Connecting [Conference][aaDH2012]. Canberra, Australia: ANU|
The aim of the research described here is the automatic generation of a topic hierarchy, using WordNet as the basis for a faceted browser interface, with a collection of 19th-century periodical texts as the test corpus. Our research was motivated by the Castanet algorithm, which was developed and successfully applied to short descriptions of documents. In our research we adapt the algorithm so that it can be applied to the full text of documents. The algorithm for the automatic generation of the topic hierarchy has three main processes: Data preparation, wherein data is prepared so that the information contained within the texts is more easily accessible; Target term extraction, wherein terms that are considered relevant to classify each text are selected, and; Topic tree generation, wherein the tree is built using the target terms. We evaluated samples of the resulting topic tree and found that over 90% of the topics are relevant, i.e. they clearly illustrate what the articles are about and the topic hierarchy adequately relates to the content of the articles. Future work will address problems resulting from mis‐OCRed words, erroneous disambiguation, and language anachronisms. Faceted browsing interfaces based on topic hierarchies are easy and intuitive to navigate, and as our results demonstrate, topic hierarchies form an appropriate basis for this type of data navigation. We are confident that our approach can successfully be applied to other corpora and should yield even better results if there are no OCR issues to contend with. Since WordNet is available in several languages, it should also be possible to apply our approach to corpora in other languages.
|Brey_Automatic2012.pdf||Powerpoint presentation slides||1.63 MB||Adobe PDF|
Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.