Data mining methodological weaknesses and suggested fixes

Maindonald, John

Data mining methodological weaknesses and suggested fixes

dc.contributor.author	Maindonald, John	en
dc.date.accessioned	2025-12-31T18:41:52Z
dc.date.available	2025-12-31T18:41:52Z
dc.date.issued	2006	en
dc.description.abstract	Predictive accuracy claims should give explicit descriptions of the steps followed, with access to the code used. This allows referees and readers to check for common traps, and to repeat the same steps on other data. Feature selection and/or model selection and/or tuning must be independent of the test data. For use of cross-validation, such steps must be repeated at each fold. Even then, such accuracy assessments have the limitation that the target population, to which results will be applied, is commonly different from the source population. Commonly, it is shifted forward in time, and it may differ in other respects also. A consequence of source/target differences is that highly sophisticated modeling may be pointless or even counter-productive. At best, model effects in the target population may be broadly similar. Investigation of the pattern of changes over time is required. Such studies are unusual in the data mining literature, in part because relevant data have not been available. Several recent investigations are noted that shed interesting light on the comparison between observational and experimental studies, with particular relevance when there is an interest in giving parameter estimates a causal interpretation. Data mining activity would benefit from wider co-operation in the development and deployment of computing tools, and from better integration of those tools into the publication process.	en
dc.description.status	Peer-reviewed	en
dc.format.extent	8	en
dc.identifier.issn	1445-1336	en
dc.identifier.scopus	84870549537	en
dc.identifier.uri	https://hdl.handle.net/1885/733797838
dc.language.iso	en	en
dc.relation.ispartofseries	5th Australasian Data Mining Conference, AusDM 2006	en
dc.source	Conferences in Research and Practice in Information Technology Series	en
dc.subject	Comparison of algorithms	en
dc.subject	Data mining	en
dc.subject	Observational data	en
dc.subject	Predictive accuracy	en
dc.subject	Reject inference	en
dc.subject	Selection bias	en
dc.subject	Statistics	en
dc.subject	Target population	en
dc.title	Data mining methodological weaknesses and suggested fixes	en
dc.type	Conference paper	en
dspace.entity.type	Publication	en
local.bibliographicCitation.lastpage	16	en
local.bibliographicCitation.startpage	9	en
local.contributor.affiliation	Maindonald, John; Mathematics Programs, Mathematical Sciences Institute, ANU College of Systems and Society, The Australian National University	en
local.identifier.ariespublication	u3488905xPUB43	en
local.identifier.citationvolume	61	en
local.identifier.pure	4b697d30-fc83-4134-9d75-47f218866ac5	en
local.identifier.url	https://www.scopus.com/pages/publications/84870549537	en
local.type.status	Published	en

Collections

ANU Research Publications

Data mining methodological weaknesses and suggested fixes

Downloads

Collections