Accurate Synthetic Generation of Realistic Personal Information

Date

2009

Authors

Christen, Peter
Pudjijono, Agus

Journal Title

Journal ISSN

Volume Title

Publisher

Springer

Abstract

A large portion of data collected by many organisations today is about people, and often contains personal identifying information, such as names and addresses. Privacy and confidentiality are of great concern when such data is being shared between organisations or made publicly available. Research in (privacy-preserving) data mining and data linkage is suffering from a lack of publicly available real-world data sets that contain personal information, and therefore experimental evaluations can be difficult to conduct. In order to overcome this problem, we have developed a data generator that allows flexible creation of synthetic data containing personal information with realistic characteristics, such as frequency distributions, attribute dependencies, and error probabilities. Our generator significantly improves earlier approaches, and allows the generation of data for individuals, families and households.

Description

Keywords

Keywords: Artificial data; Data linkage; Data matching; Error probabilities; Experimental evaluation; Frequency distributions; Personal information; Privacy; Privacy preserving; Real world data; Synthetic data; Synthetic generation; Mining; Probability distribution Artificial data; Data linkage; Data matching; Data mining pre-processing; Privacy

Citation

Source

Type

Book chapter

Book Title

Advances in Knowledge Discovery and Data Mining

Entity type

Access Statement

License Rights

Restricted until

2037-12-31