Skip navigation
Skip navigation

Quality and Complexity Measures for Data Linkage and Deduplication

Christen, Peter; Goiser, Karl

Description

Deduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and...[Show more]

dc.contributor.authorChristen, Peter
dc.contributor.authorGoiser, Karl
dc.date.accessioned2015-12-08T22:33:28Z
dc.identifier.isbn9783540449119
dc.identifier.urihttp://hdl.handle.net/1885/34693
dc.description.abstractDeduplicating one data set or linking several data sets are increasingly important tasks in the data preparation steps of many data mining projects. The aim of such linkages is to match all records relating to the same entity. Research interest in this area has increased in recent years, with techniques originating from statistics, machine learning, information retrieval, and database research being combined and applied to improve the linkage quality, as well as to increase performance and efficiency when linking or deduplicating very large data sets. Different measures have been used to characterise the quality and complexity of data linkage algorithms, and several new metrics have been proposed. An overview of the issues involved in measuring data linkage and deduplication quality and complexity is presented in this chapter. It is shown that measures in the space of record pair comparisons can produce deceptive quality results. Various measures are discussed and recommendations are given on how to assess data linkage and deduplication quality and complexity.
dc.publisherSpringer
dc.relation.ispartofQuality Measures in Data Mining: Studies in Computational Intelligence
dc.relation.isversionof1st Edition
dc.subjectKeywords: Data integration and matching; Data mining pre-processing; Data or record linkage; Deduplication; Quality and complexity measures
dc.titleQuality and Complexity Measures for Data Linkage and Deduplication
dc.typeBook chapter
local.description.notesImported from ARIES
dc.date.issued2007
local.identifier.absfor080109 - Pattern Recognition and Data Mining
local.identifier.absfor080201 - Analysis of Algorithms and Complexity
local.identifier.ariespublicationU3594520xPUB116
local.type.statusPublished Version
local.contributor.affiliationChristen, Peter, College of Engineering and Computer Science, ANU
local.contributor.affiliationGoiser, Karl, College of Engineering and Computer Science, ANU
local.description.embargo2037-12-31
local.bibliographicCitation.startpage127
local.bibliographicCitation.lastpage151
local.identifier.doi10.1007/978-3-540-44918-8_6
dc.date.updated2015-12-08T09:36:06Z
local.bibliographicCitation.placeofpublicationUSA
local.identifier.scopusID2-s2.0-33846428121
CollectionsANU Research Publications

Download

File Description SizeFormat Image
01_Christen_Quality_and_Complexity_2007.pdf263.69 kBAdobe PDF    Request a copy


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  17 November 2022/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator