Febrl - A Parallel Open Source Data Linkage System

Date

2004

Authors

Christen, Peter
Churches, Tim
Hegland, Markus

Journal Title

Journal ISSN

Volume Title

Publisher

Springer

Abstract

In many data mining projects information from multiple data sources needs to be integrated, combined or linked in order to allow more detailed analysis. The aim of such linkages is to merge all records relating to the same entity, such as a patient or a customer. Most of the time the linkage process is challenged by the lack of a common unique entity identifier, and thus becomes non-trivial. Linking todays large data collections becomes increasingly difficult using traditional linkage techniques. In this paper we present an innovating data linkage system called Febrl, which includes a new probabilistic approach for improved data cleaning and standardisation, innovative indexing methods, a parallelisation approach which is implemented transparently to the user, and a data set generator which allows the random creation of records containing names and addresses. Implemented as open source software, Febrl is an ideal experimental platform for new linkage algorithms and techniques.

Description

Keywords

Keywords: Algorithms; Computer software; Information retrieval; Probability; Project management; Standardization; User interfaces; Data cleaning and standardization; Data matching; Data mining preprocessing; Parallel processing; Record linkage; Data mining Data cleaning and standardisation; Data matching; Data mining preprocessing; Parallel processing; Record linkage

Citation

Source

Advances in Knowledge Discovery and Data Mining. 8th Pacific-Asia Conference, PAKDD 2004 Proceedings

Type

Conference paper

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until