Entity Resolution with Active Learning
Date
2021
Authors
Shao, Jingyu
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Entity Resolution refers to the process of identifying records which represent the same real-world entity from one or more datasets. In the big data era, large numbers of entities need to be resolved, which leads to several key challenges, especially for learning-based ER approaches: (1) With the number of records increasing, the computational complexity of the algorithm grows exponentially. (2) Quite a number of samples are necessary for training, but only a limited number of labels are available, especially when the training samples are highly imbalanced. Blocking technique helps to improve the time efficiency by grouping potentially matched records into the same block. Thus to address the above two challenges, in this thesis, we first introduce a novel blocking scheme learning approach based on active learning techniques. With a limited label budget, our approach can learn a blocking scheme to generate high quality blocks. Two strategies called active sampling and active branching are proposed to select samples and generate blocking schemes efficiently. Additionally, a skyblocking framework is proposed as an extension, which aims to learn scheme skylines. In this framework, each blocking scheme is mapped as a point to a multi-dimensional scheme space where each block-ing measure represents one dimension. A scheme skyline contains blocking schemes that are not dominated by any other blocking schemes in the scheme space. We develop three scheme skyline learning algorithms for efficiently learning scheme skylines under a given number of blocking measures and within a label budget limit.
While blocks are well established, we further develop the Learning To Sample approach to deal with the second challenge, i.e. training a learning-based active learning model with as mall number of labeled samples. This approach has two key components: a sampling model and a boosting model, which can mutually learn from each other in iterations to improve the performance of each other. Within this framework, the sampling model incorporates uncertainty sampling and diversity sampling into a unified process for optimization, enabling us to actively select the most representative and informative samples based on an optimized integration of uncertainty and diversity. On the contrary of training with a limited number of samples, a powerful machine learning model may be overfitting by remembering all the sample features. Inspired by recent advances of generative adversarial network (GAN), in this paper, we propose a novel deep learning method, called ERGAN, to address the challenge. ERGAN consists of two key components: a label generator and a discriminator which are optimized alternatively through adversarial learning. To alleviate the issues of overfitting and highly imbalanced distribution, we design two novel modules for diversity and propagation, which can greatly improve the model generalization power. We theoretically prove that ERGAN can overcome the model collapse and convergence problems in the original GAN. We also conduct extensive experiments to empirically verify the labeling and learning efficiency of ERGAN.
Description
Keywords
Citation
Collections
Source
Type
Thesis (PhD)
Book Title
Entity type
Access Statement
License Rights
Restricted until
Downloads
File
Description
Thesis Material