Skip navigation
Skip navigation

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Christen, Peter; Gayler, Ross

Description

Most research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have as-sumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity...[Show more]

dc.contributor.authorChristen, Peter
dc.contributor.authorGayler, Ross
dc.coverage.spatialGlenelg Australia
dc.date.accessioned2015-12-10T21:56:40Z
dc.date.createdNovember 27-28 2008
dc.identifier.isbn9781920682682
dc.identifier.urihttp://hdl.handle.net/1885/39539
dc.description.abstractMost research into entity resolution (also known as record linkage or data matching) has concentrated on the quality of the matching results. In this paper, we focus on matching time and scalability, with the aim to achieve large-scale real-time entity resolution. Traditional entity resolution techniques have as-sumed the matching of two static databases. In our networked and online world, however, it is becoming increasingly important for many organisations to be able to conduct entity resolution between a collection of often very large databases and a stream of query or update records. The matching should be done in (near) real-time, and be as automatic and accurate as possible, returning a ranked list of matched records for each given query record. This task therefore be-comes similar to querying large document collections, as done for example by Web search engines, however based on a different type of documents: structured database records that, for example, contain personal information, such as names and addresses. In this paper, we investigate inverted indexing techniques, as commonly used in Web search engines, and employ them for real-time entity resolution. We present two variations of the traditional inverted in-dex approach, aimed at facilitating fast approximate matching. We show encouraging initial results on large real-world data sets, with the inverted index ap-proaches being up-to one hundred times faster than the traditionally used standard blocking approach. However, this improved matching speed currently comes at a cost, in that matching quality for larger data sets can be lower compared to when tandard blocking is used, and thus more work is required.
dc.publisherAssociation for Computing Machinery Inc (ACM)
dc.relation.ispartofseriesAustralasian Data Mining Conference (AusDM 2008)
dc.sourceProceedings of the seventh Autsralasian Data Mining Conference (AusDM 2008)
dc.subjectKeywords: Approximate matching; Data matching; Data sets; Document collection; Inverted indexing; Inverted indices; Matching speed; Personal information; Real world data; Record linkage; Resolution techniques; Similarity measure; Standard blocking; String compariso Approximate string comparisons; Data matching; Record linkage; Scalability; Similarity measures.
dc.titleTowards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach
dc.typeConference paper
local.description.notesImported from ARIES
local.description.refereedYes
dc.date.issued2008
local.identifier.absfor080108 - Neural, Evolutionary and Fuzzy Computation
local.identifier.ariespublicationU3594520xPUB179
local.type.statusPublished Version
local.contributor.affiliationChristen, Peter, College of Engineering and Computer Science, ANU
local.contributor.affiliationGayler, Ross, Veda Advantage
local.description.embargo2037-12-31
local.bibliographicCitation.startpage10
dc.date.updated2016-02-24T10:17:11Z
local.identifier.scopusID2-s2.0-67650216370
CollectionsANU Research Publications

Download

File Description SizeFormat Image
01_Christen_Towards_Scalable_Real-Time_2008.pdf224.19 kBAdobe PDFThumbnail


Items in Open Research are protected by copyright, with all rights reserved, unless otherwise indicated.

Updated:  19 May 2020/ Responsible Officer:  University Librarian/ Page Contact:  Library Systems & Web Coordinator