Efficient Entity Resolution with Adaptive and Interactive Training Data Selection

Christen, Peter; Vatsalan, Dinusha; Wang, Qing (Ms)

Efficient Entity Resolution with Adaptive and Interactive Training Data Selection

Date

2015

Authors

Christen, Peter

Vatsalan, Dinusha

Wang, Qing (Ms)

Publisher

IEEE Computer Society

Abstract

Entity resolution (ER) is the task of deciding which records in one or more databases refer to the same real-world entities. A crucial step in ER is the accurate classification of pairs of records into matches and non-matches. In most practical ER applications, obtaining training data %of high quality is costly and time consuming. Various techniques have been proposed for ER to interactively generate training data and learn an accurate classifier. We propose an approach for training data selection for ER that exploits the cluster structure of the weight vectors (similarities) calculated from compared record pairs. Our approach adaptively selects an optimal number of informative training examples for manual labeling based on a user defined sampling error margin, and recursively splits the set of weight vectors to find pure enough subsets for training. We consider two aspects of ER that are highly significant in practice: a limited budget for the number of manual labeling that can be done, and a noisy oracle where manual labels might be incorrect. Experiments on four real public data sets show that our approach can significantly reduce manual labeling efforts for training an ER classifier while achieving matching quality comparative to fully supervised classifiers

URI

http://hdl.handle.net/1885/103819

Collections

ANU Research Publications

Source

Efficient Entity Resolution with Adaptive and Interactive Training Data Selection

Type

Conference paper

DOI

10.1109/ICDM.2015.63

Restricted until

2037-12-31

Downloads

File

Description

01_Christen_Efficient_Entity_Resolution_2015.pdf (363.7 KB)

Full item page

Cultural advice

Efficient Entity Resolution with Adaptive and Interactive Training Data Selection

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Citation

URI

Collections

Source

Type

Book Title

Entity type

Access Statement

License Rights

DOI

Restricted until

Downloads