Entity Resolution with Active Learning

dc.contributor.authorShao, Jingyu
dc.date.accessioned2021-10-04T16:13:53Z
dc.date.available2021-10-04T16:13:53Z
dc.date.issued2021
dc.description.abstractEntity Resolution refers to the process of identifying records which represent the same real-world entity from one or more datasets. In the big data era, large numbers of entities need to be resolved, which leads to several key challenges, especially for learning-based ER approaches: (1) With the number of records increasing, the computational complexity of the algorithm grows exponentially. (2) Quite a number of samples are necessary for training, but only a limited number of labels are available, especially when the training samples are highly imbalanced. Blocking technique helps to improve the time efficiency by grouping potentially matched records into the same block. Thus to address the above two challenges, in this thesis, we first introduce a novel blocking scheme learning approach based on active learning techniques. With a limited label budget, our approach can learn a blocking scheme to generate high quality blocks. Two strategies called active sampling and active branching are proposed to select samples and generate blocking schemes efficiently. Additionally, a skyblocking framework is proposed as an extension, which aims to learn scheme skylines. In this framework, each blocking scheme is mapped as a point to a multi-dimensional scheme space where each block-ing measure represents one dimension. A scheme skyline contains blocking schemes that are not dominated by any other blocking schemes in the scheme space. We develop three scheme skyline learning algorithms for efficiently learning scheme skylines under a given number of blocking measures and within a label budget limit. While blocks are well established, we further develop the Learning To Sample approach to deal with the second challenge, i.e. training a learning-based active learning model with as mall number of labeled samples. This approach has two key components: a sampling model and a boosting model, which can mutually learn from each other in iterations to improve the performance of each other. Within this framework, the sampling model incorporates uncertainty sampling and diversity sampling into a unified process for optimization, enabling us to actively select the most representative and informative samples based on an optimized integration of uncertainty and diversity. On the contrary of training with a limited number of samples, a powerful machine learning model may be overfitting by remembering all the sample features. Inspired by recent advances of generative adversarial network (GAN), in this paper, we propose a novel deep learning method, called ERGAN, to address the challenge. ERGAN consists of two key components: a label generator and a discriminator which are optimized alternatively through adversarial learning. To alleviate the issues of overfitting and highly imbalanced distribution, we design two novel modules for diversity and propagation, which can greatly improve the model generalization power. We theoretically prove that ERGAN can overcome the model collapse and convergence problems in the original GAN. We also conduct extensive experiments to empirically verify the labeling and learning efficiency of ERGAN.
dc.identifier.otherb73317561
dc.identifier.urihttp://hdl.handle.net/1885/250431
dc.language.isoen_AU
dc.titleEntity Resolution with Active Learning
dc.typeThesis (PhD)
local.contributor.authoremailu6160749@anu.edu.au
local.contributor.supervisorWang, Qing
local.contributor.supervisorcontactu5170295@anu.edu.au
local.identifier.doi10.25911/8XJS-DR76
local.identifier.proquestYes
local.mintdoimint
local.thesisANUonly.author7f0d2db4-daab-4734-a64c-f714f9deebbb
local.thesisANUonly.keyaa006f62-84e4-559f-d2d2-1900aeaf8fd3
local.thesisANUonly.title000000015864_TC_1

Downloads

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Thesis_final.pdf
Size:
1.79 MB
Format:
Adobe Portable Document Format
Description:
Thesis Material