Open Research will be unavailable from 10.15am - 11am on Saturday 14th March 2026 AEDT due to scheduled maintenance.
 

Data mining methodological weaknesses and suggested fixes

dc.contributor.authorMaindonald, Johnen
dc.date.accessioned2025-12-31T18:41:52Z
dc.date.available2025-12-31T18:41:52Z
dc.date.issued2006en
dc.description.abstractPredictive accuracy claims should give explicit descriptions of the steps followed, with access to the code used. This allows referees and readers to check for common traps, and to repeat the same steps on other data. Feature selection and/or model selection and/or tuning must be independent of the test data. For use of cross-validation, such steps must be repeated at each fold. Even then, such accuracy assessments have the limitation that the target population, to which results will be applied, is commonly different from the source population. Commonly, it is shifted forward in time, and it may differ in other respects also. A consequence of source/target differences is that highly sophisticated modeling may be pointless or even counter-productive. At best, model effects in the target population may be broadly similar. Investigation of the pattern of changes over time is required. Such studies are unusual in the data mining literature, in part because relevant data have not been available. Several recent investigations are noted that shed interesting light on the comparison between observational and experimental studies, with particular relevance when there is an interest in giving parameter estimates a causal interpretation. Data mining activity would benefit from wider co-operation in the development and deployment of computing tools, and from better integration of those tools into the publication process.en
dc.description.statusPeer-revieweden
dc.format.extent8en
dc.identifier.issn1445-1336en
dc.identifier.scopus84870549537en
dc.identifier.urihttps://hdl.handle.net/1885/733797838
dc.language.isoenen
dc.relation.ispartofseries5th Australasian Data Mining Conference, AusDM 2006en
dc.sourceConferences in Research and Practice in Information Technology Seriesen
dc.subjectComparison of algorithmsen
dc.subjectData miningen
dc.subjectObservational dataen
dc.subjectPredictive accuracyen
dc.subjectReject inferenceen
dc.subjectSelection biasen
dc.subjectStatisticsen
dc.subjectTarget populationen
dc.titleData mining methodological weaknesses and suggested fixesen
dc.typeConference paperen
dspace.entity.typePublicationen
local.bibliographicCitation.lastpage16en
local.bibliographicCitation.startpage9en
local.contributor.affiliationMaindonald, John; Mathematics Programs, Mathematical Sciences Institute, ANU College of Systems and Society, The Australian National Universityen
local.identifier.ariespublicationu3488905xPUB43en
local.identifier.citationvolume61en
local.identifier.pure4b697d30-fc83-4134-9d75-47f218866ac5en
local.identifier.urlhttps://www.scopus.com/pages/publications/84870549537en
local.type.statusPublisheden

Downloads