Entity Resolution with Active Learning

Shao, Jingyu

Entity Resolution with Active Learning

dc.contributor.author	Shao, Jingyu
dc.date.accessioned	2021-10-04T16:13:53Z
dc.date.available	2021-10-04T16:13:53Z
dc.date.issued	2021
dc.description.abstract	Entity Resolution refers to the process of identifying records which represent the same real-world entity from one or more datasets. In the big data era, large numbers of entities need to be resolved, which leads to several key challenges, especially for learning-based ER approaches: (1) With the number of records increasing, the computational complexity of the algorithm grows exponentially. (2) Quite a number of samples are necessary for training, but only a limited number of labels are available, especially when the training samples are highly imbalanced. Blocking technique helps to improve the time efficiency by grouping potentially matched records into the same block. Thus to address the above two challenges, in this thesis, we first introduce a novel blocking scheme learning approach based on active learning techniques. With a limited label budget, our approach can learn a blocking scheme to generate high quality blocks. Two strategies called active sampling and active branching are proposed to select samples and generate blocking schemes efficiently. Additionally, a skyblocking framework is proposed as an extension, which aims to learn scheme skylines. In this framework, each blocking scheme is mapped as a point to a multi-dimensional scheme space where each block-ing measure represents one dimension. A scheme skyline contains blocking schemes that are not dominated by any other blocking schemes in the scheme space. We develop three scheme skyline learning algorithms for efficiently learning scheme skylines under a given number of blocking measures and within a label budget limit. While blocks are well established, we further develop the Learning To Sample approach to deal with the second challenge, i.e. training a learning-based active learning model with as mall number of labeled samples. This approach has two key components: a sampling model and a boosting model, which can mutually learn from each other in iterations to improve the performance of each other. Within this framework, the sampling model incorporates uncertainty sampling and diversity sampling into a unified process for optimization, enabling us to actively select the most representative and informative samples based on an optimized integration of uncertainty and diversity. On the contrary of training with a limited number of samples, a powerful machine learning model may be overfitting by remembering all the sample features. Inspired by recent advances of generative adversarial network (GAN), in this paper, we propose a novel deep learning method, called ERGAN, to address the challenge. ERGAN consists of two key components: a label generator and a discriminator which are optimized alternatively through adversarial learning. To alleviate the issues of overfitting and highly imbalanced distribution, we design two novel modules for diversity and propagation, which can greatly improve the model generalization power. We theoretically prove that ERGAN can overcome the model collapse and convergence problems in the original GAN. We also conduct extensive experiments to empirically verify the labeling and learning efficiency of ERGAN.
dc.identifier.other	b73317561
dc.identifier.uri	http://hdl.handle.net/1885/250431
dc.language.iso	en_AU
dc.title	Entity Resolution with Active Learning
dc.type	Thesis (PhD)
local.contributor.supervisor	Wang, Qing
local.identifier.doi	10.25911/8XJS-DR76
local.identifier.proquest	Yes
local.mintdoi	mint
local.thesisANUonly.author	7f0d2db4-daab-4734-a64c-f714f9deebbb
local.thesisANUonly.key	aa006f62-84e4-559f-d2d2-1900aeaf8fd3
local.thesisANUonly.title	000000015864_TC_1

Downloads

Original bundle

Now showing 1 - 1 of 1

Name:: Thesis_final.pdf
Size:: 1.79 MB
Format:: Adobe Portable Document Format
Description:: Thesis Material

Download

Collections

Open Access Theses