Febrl - A Parallel Open Source Data Linkage System




Christen, Peter
Churches, Tim
Hegland, Markus

Journal Title

Journal ISSN

Volume Title




In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.



Keywords: Algorithms; Computer software; Information retrieval; Probability; Project management; Standardization; User interfaces; Data cleaning and standardization; Data matching; Data mining preprocessing; Parallel processing; Record linkage; Data mining Data cleaning and standardisation; Data matching; Data mining preprocessing; Parallel processing; Record linkage



Advances in Knowledge Discovery and Data Mining. 8th Pacific-Asia Conference, PAKDD 2004 Proceedings


Conference paper

Book Title

Entity type

Access Statement

License Rights


Restricted until