Febrl - A Parallel Open Source Data Linkage System
Date
2004
Authors
Christen, Peter
Churches, Tim
Hegland, Markus
Journal Title
Journal ISSN
Volume Title
Publisher
Springer
Abstract
In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.
Description
Keywords
Keywords: Algorithms; Computer software; Information retrieval; Probability; Project management; Standardization; User interfaces; Data cleaning and standardization; Data matching; Data mining preprocessing; Parallel processing; Record linkage; Data mining Data cleaning and standardisation; Data matching; Data mining preprocessing; Parallel processing; Record linkage
Citation
Collections
Source
Advances in Knowledge Discovery and Data Mining. 8th Pacific-Asia Conference, PAKDD 2004 Proceedings
Type
Conference paper