Automatic identification of the most important elements in an XML collection
Loading...
Date
Authors
Krumpholz, Alexander
Hadad, Amir
Studeny, Nina
Gedeon, Tamas (Tom)
Hawking, David
Journal Title
Journal ISSN
Volume Title
Publisher
RMIT University
Abstract
An important problem in XML retrieval is determining the most useful element types to retrieve - e.g. book, chapter, section, paragraph or caption. An automated system for doing this could be based on features of element types related to size, depth, frequency of occurrence, etc. We consider a large number of such features and assess their usefulness in predicting the types of elements judged relevant in INEX evaluations for the IEEE and Wikipedia 2006 corpora. For each feature we automatically assign Useful / Not-Useful labels to element types using Fuzzy c-Means Clustering. We then rank the features by the accuracy with which they predict the manual judgments. We find strong overlap between the top-ten most predictive features for the two collections and that seven features achieve high average accuracy (F-measure > 65%) acrosss them. We hypothesize that an XML retrieval system working on an unlabelled corpus could use these features to decide which retrieval units are most appropriate to return to the user.
Description
Citation
Collections
Source
Proceedings of the Sixteenth
Australasian Document Computing Symposium
Type
Book Title
Entity type
Access Statement
License Rights
DOI
Restricted until
2037-12-31