Efficient Cryptanalysis Techniques for Privacy-Preserving Record Linkage

Vidanage, Anushka

Efficient Cryptanalysis Techniques for Privacy-Preserving Record Linkage

Date

2022

Authors

Vidanage, Anushka

Abstract

The linking of records across databases has seen an increasing interest over the last few decades in domains ranging from national census and healthcare to crime and fraud detection. This is due to the ability of record linkage (RL) to improve data quality and facilitate advanced data mining. Due to the absence of unique entity identifiers across the databases to be linked, RL is generally based on quasi-identifying (QID) attribute values of entities, such as their names, addresses, and dates of birth. However, the use of such personal identifying information often leads to individual, ethical, and legal concerns associated with privacy and confidentiality. Privacy-preserving record linkage (PPRL) seeks to develop techniques that allow linkage of databases without compromising the privacy of the entities whose records are being linked. In general, PPRL techniques encode and/or encrypt QID values in sensitive databases in order to protect the privacy of the entities while allowing accurate linkage of records using encoded and/or encrypted values. However, certain PPRL techniques, such as popular Bloom filter encoding, have shown to be susceptible to privacy attacks, and a number of such attacks have been proposed on PPRL techniques over the years. While these attacks reveal different weaknesses in PPRL techniques, they also have limitations including the requirement of knowledge by an adversary about specific parameters, and significant memory and time consumption. Therefore, further research into both analysing existing privacy attacks and exploring novel attack methods is vital to better understand the risks associated with real-world PPRL projects. In this thesis we present a comprehensive research study about privacy attacks on PPRL. We start by proposing a taxonomy of attacks on PPRL in which the existing attacks are categorised under twelve dimensions. Our taxonomy can be used to analyse the characteristics of privacy attacks and identify their limitations. Next, we propose a framework to quantify the vulnerabilities associated with both plaintext and encoded values of a sensitive database. Such a framework will help data custodians to assess the privacy guarantees of various PPRL techniques applied on their own databases, and help them to make informed decisions based on that assessment. We then propose three novel privacy attacks on different PPRL techniques which overcome the drawbacks of existing attacks. Our attacks exploit weaknesses of PPRL techniques in order to reidentify encoded QID values. Compared to previous attacks, our attacks require less knowledge of the encoding process, where one of the attacks can be applied on a diverse range of PPRL techniques. Finally, we conduct an experimental evaluation of all three of the proposed attack methods using real-world databases and compare them against several existing privacy attacks on PPRL. The results of this evaluation show that our attacks outperform existing attacks on PPRL in terms of reidentification accuracy and scalability to large databases.