Scalable entity resolution using probabilistic signatures on parallel databases

Date

2018

Authors

Zhang, Yuhang
Ng, Kee Siong
Churchill, Tania
Christen, Peter

Journal Title

Journal ISSN

Volume Title

Publisher

Association for Computing Machinery (ACM)

Abstract

Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles formulation of entity resolution, this paper presents a novel entity resolution algorithm that introduces a data-driven blocking and record linkage technique based on the probabilistic identification of entity signatures in data. The scalability and accuracy of the proposed algorithm are evaluated using benchmark datasets and shown to achieve state-of-the-art results. The proposed algorithm can be implemented simply on modern parallel databases, which we have done in the financial intelligence domain with tens of Terabytes of noisy data.

Description

Keywords

Large-scale entity resolution, connected components, probabilistic signature, in-database analytics

Citation

Source

International Conference on Information and Knowledge Management, Proceedings

Type

Conference paper

Book Title

Entity type

Access Statement

License Rights

Restricted until

2099-12-31