Automatic extraction of topic hierarchies based on WordNet

Brey, Gerhard; Vieira, Miguel

A change is coming. Click to see a sneak peek of the new Open Research Repository.

Automatic extraction of topic hierarchies based on WordNet

dc.contributor.author	Brey, Gerhard
dc.contributor.author	Vieira, Miguel
dc.date.accessioned	2012-04-17T23:39:36Z
dc.date.available	2012-04-17T23:39:36Z
dc.date.created	2012-03
dc.identifier.citation	Brey, G. & Vieira, M. (March 2012). Automatic extraction of topic hierarchies based on WordNet. Presentation at the Digital Humanities Australasia 2012: Building, Mapping, Connecting [Conference][aaDH2012]. Canberra, Australia: ANU
dc.identifier.uri	http://hdl.handle.net/1885/8990
dc.description.abstract	The aim of the research described here is the automatic generation of a topic hierarchy, using WordNet as the basis for a faceted browser interface, with a collection of 19th-century periodical texts as the test corpus. Our research was motivated by the Castanet algorithm, which was developed and successfully applied to short descriptions of documents. In our research we adapt the algorithm so that it can be applied to the full text of documents. The algorithm for the automatic generation of the topic hierarchy has three main processes: Data preparation, wherein data is prepared so that the information contained within the texts is more easily accessible; Target term extraction, wherein terms that are considered relevant to classify each text are selected, and; Topic tree generation, wherein the tree is built using the target terms. We evaluated samples of the resulting topic tree and found that over 90% of the topics are relevant, i.e. they clearly illustrate what the articles are about and the topic hierarchy adequately relates to the content of the articles. Future work will address problems resulting from mis‐OCRed words, erroneous disambiguation, and language anachronisms. Faceted browsing interfaces based on topic hierarchies are easy and intuitive to navigate, and as our results demonstrate, topic hierarchies form an appropriate basis for this type of data navigation. We are confident that our approach can successfully be applied to other corpora and should yield even better results if there are no OCR issues to contend with. Since WordNet is available in several languages, it should also be possible to apply our approach to corpora in other languages.
dc.description.sponsorship	Australian Academy of the Humanities; the ANU College of Arts and Social Sciences
dc.format.extent	20 slides
dc.format.mimetype	application/pdf
dc.language.iso	en_AU
dc.publisher	Australasian Association for Digital Humanities
dc.relation.ispartof	Australasian Association for Digital Humanities Conference (1st : 2012 : The Australian National University, Canberra, ACT)
dc.rights	Author/s retain copyright
dc.title	Automatic extraction of topic hierarchies based on WordNet
dc.type	Conference presentation
local.description.notes	Inaugural Conference of the Australasian Association for Digital Humanities held 27-30 March, 2012. Presentation given by Jamie Norrish
local.publisher.url	http://aa-dh.org/conference/
local.type.status	Published Version
local.contributor.affiliation	Brey, Gerhard, King's College London, Department of Digital Humanities
local.contributor.affiliation	Vieira, Miguel, King's College London, Department of Digital Humanities
dcterms.accessRights	Open Access
dc.provenance	The copyright is owned by the authors. The conference organisers make no claim over copyright
Collections	Digital Humanities Australasia: Building, Mapping, Connecting (2012)

Download

File	Description	Size	Format	Image
Brey_Automatic2012.pdf	Powerpoint presentation slides	1.63 MB	Adobe PDF

Show simple item record